Tag Archives: hash

Live Mayweather v McGregor Streams Will Thrive On Torrents Tonight

Post Syndicated from Andy original https://torrentfreak.com/live-mayweather-v-mcgregor-streams-will-thrive-on-torrents-tonight-170826/

Tonight, August 26, at the T-Mobile Arena in Las Vegas, Floyd Mayweather Jr. will finally meet UFC lightweight champion Conor McGregor in what is being billed as the biggest fight in boxing history.

Although tickets for inside the arena are still available for those with a lot of money to burn, most fans will be viewing on a screen of some kind, whether that’s in a cinema, sports bar, or at home in front of a TV.

The fight will be available on Showtime in the United States but the promoters also say they’ve done their best to make it accessible to millions of people in dozens of countries, with varying price tags dependent on region. Nevertheless, due to generally high prices, it’s likely that untold thousands around the world will attempt to watch the fight without paying.

That will definitely be possible. Although Showtime has won a pre-emptive injunction to stop some sites offering the fight, many hundreds of others are likely to fill in the gaps, offering generally lower-quality streams to the eager masses. Whether all of these sites will be able to cope with what could be unprecedented demand will remain to be seen, but there is one method that will thrive under the pressure.

Torrent technology is best known for offering content after it’s aired, whether that’s the latest episode of Game of Thrones or indeed a recording of the big fight scheduled for the weekend. However, what most ‘point-and-click’ file-sharers won’t know is that there’s a torrent-based technology that offers live sporting events week in, week out.

Without going into too many technical details, AceStream / Ace Player HD is a torrent engine built into the ever-popular VLC media player. It’s available on Windows, Android and Linux, costs nothing to install, and is incredibly easy to use.

Where regular torrent clients handle both .torrent files and magnet links, AceStream relies on an AceStream Content ID to find streams to play instead. This ID is a hash value (similar to one seen in magnet links, but prefaced with ‘acestream://’) which relates to the stream users want to view.

Once found, these can be copied to the user’s clipboard and pasted into the ‘Open Ace Stream Content ID’ section of the player’s file menu. Click ‘play’ and it’s done – it really is that simple.

AceStream is simplicity itself

Of course, any kind of content – both authorized and unauthorized – can be streamed and shared using AceStream and there are hundreds of live channels available, some in very high quality, 24/7. Inevitably, however, there’s quite an emphasis on premium content from sports broadcasters around the world, with fresh links to content shared on a daily basis.

The screenshot below shows a typical AceStream Content ID indexing site, with channels on the left, AceStream Content IDs in the center, plus language and then stream speed on the far right. (Note: TF has redacted the links since many will still be live at time of publication)

A typical AceSteam Content ID listing

While streams of most major TV channels are relatively easy to find, specialist channels showing PPV events are a little bit more difficult to discover. For those who know where to look, however, the big fight will be only a cut-and-paste away and in much better quality than that found on most web-based streaming portals.

All that being said, for torrent enthusiasts the magic lies in the ability of the technology to adapt to surging demand. While websites and streams wilt under the load Saturday night, it’s likely that AceStream streams will thrive under the pressure, with viewers (downloaders/streamers) also becoming distributors (uploaders) to others watching the event unfold.

With this in mind, it’s worth noting that while AceStream is efficient and resilient, using it to watch infringing content is illegal in most regions, since simultaneous uploading also takes place. Still, that’s unlikely to frighten away enthusiasts, who will already be aware of the risks and behind a VPN.

Ace Streams do have an Achilles heel though. Unlike a regular torrent swarm, where the initial seeder can disappear once a full copy of the movie or TV show is distributed around other peers, AceStreams are completely reliant on the initial stream seeder at all times. If he or she disappears, the live stream dies and it is all over. For this reason, people looking to stream often have a couple of extra stream hashes standing by.

But for big fans (who also have the money to spend, of course), the decision to pirate rather than pay is one not to be taken lightly. The fight will be a huge spectacle that will probably go down in history as the biggest combat sports event of all time. If streams go down early, that moment will be gone forever, so forget telling your kids about the time you watched McGregor knock out Mayweather in Round Two.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum

Post Syndicated from Dylan Tong original https://aws.amazon.com/blogs/big-data/from-data-lake-to-data-warehouse-enhancing-customer-360-with-amazon-redshift-spectrum/

Achieving a 360o-view of your customer has become increasingly challenging as companies embrace omni-channel strategies, engaging customers across websites, mobile, call centers, social media, physical sites, and beyond. The promise of a web where online and physical worlds blend makes understanding your customers more challenging, but also more important. Businesses that are successful in this medium have a significant competitive advantage.

The big data challenge requires the management of data at high velocity and volume. Many customers have identified Amazon S3 as a great data lake solution that removes the complexities of managing a highly durable, fault tolerant data lake infrastructure at scale and economically.

AWS data services substantially lessen the heavy lifting of adopting technologies, allowing you to spend more time on what matters most—gaining a better understanding of customers to elevate your business. In this post, I show how a recent Amazon Redshift innovation, Redshift Spectrum, can enhance a customer 360 initiative.

Customer 360 solution

A successful customer 360 view benefits from using a variety of technologies to deliver different forms of insights. These could range from real-time analysis of streaming data from wearable devices and mobile interactions to historical analysis that requires interactive, on demand queries on billions of transactions. In some cases, insights can only be inferred through AI via deep learning. Finally, the value of your customer data and insights can’t be fully realized until it is operationalized at scale—readily accessible by fleets of applications. Companies are leveraging AWS for the breadth of services that cover these domains, to drive their data strategy.

A number of AWS customers stream data from various sources into a S3 data lake through Amazon Kinesis. They use Kinesis and technologies in the Hadoop ecosystem like Spark running on Amazon EMR to enrich this data. High-value data is loaded into an Amazon Redshift data warehouse, which allows users to analyze and interact with data through a choice of client tools. Redshift Spectrum expands on this analytics platform by enabling Amazon Redshift to blend and analyze data beyond the data warehouse and across a data lake.

The following diagram illustrates the workflow for such a solution.

This solution delivers value by:

  • Reducing complexity and time to value to deeper insights. For instance, an existing data model in Amazon Redshift may provide insights across dimensions such as customer, geography, time, and product on metrics from sales and financial systems. Down the road, you may gain access to streaming data sources like customer-care call logs and website activity that you want to blend in with the sales data on the same dimensions to understand how web and call center experiences maybe correlated with sales performance. Redshift Spectrum can join these dimensions in Amazon Redshift with data in S3 to allow you to quickly gain new insights, and avoid the slow and more expensive alternative of fully integrating these sources with your data warehouse.
  • Providing an additional avenue for optimizing costs and performance. In cases like call logs and clickstream data where volumes could be many TBs to PBs, storing the data exclusively in S3 yields significant cost savings. Interactive analysis on massive datasets may now be economically viable in cases where data was previously analyzed periodically through static reports generated by inexpensive batch processes. In some cases, you can improve the user experience while simultaneously lowering costs. Spectrum is powered by a large-scale infrastructure external to your Amazon Redshift cluster, and excels at scanning and aggregating large volumes of data. For instance, your analysts maybe performing data discovery on customer interactions across millions of consumers over years of data across various channels. On this large dataset, certain queries could be slow if you didn’t have a large Amazon Redshift cluster. Alternatively, you could use Redshift Spectrum to achieve a better user experience with a smaller cluster.

Proof of concept walkthrough

To make evaluation easier for you, I’ve conducted a Redshift Spectrum proof-of-concept (PoC) for the customer 360 use case. For those who want to replicate the PoC, the instructions, AWS CloudFormation templates, and public data sets are available in the GitHub repository.

The remainder of this post is a journey through the project, observing best practices in action, and learning how you can achieve business value. The walkthrough involves:

  • An analysis of performance data from the PoC environment involving queries that demonstrate blending and analysis of data across Amazon Redshift and S3. Observe that great results are achievable at scale.
  • Guidance by example on query tuning, design, and data preparation to illustrate the optimization process. This includes tuning a query that combines clickstream data in S3 with customer and time dimensions in Amazon Redshift, and aggregates ~1.9 B out of 3.7 B+ records in under 10 seconds with a small cluster!
  • Guidance and measurements to help assess deciding between two options: accessing and analyzing data exclusively in Amazon Redshift, or using Redshift Spectrum to access data left in S3.

Stream ingestion and enrichment

The focus of this post isn’t stream ingestion and enrichment on Kinesis and EMR, but be mindful of performance best practices on S3 to ensure good streaming and query performance:

  • Use random object keys: The data files provided for this project are prefixed with SHA-256 hashes to prevent hot partitions. This is important to ensure that optimal request rates to support PUT requests from the incoming stream in addition to certain queries from large Amazon Redshift clusters that could send a large number of parallel GET requests.
  • Micro-batch your data stream: S3 isn’t optimized for small random write workloads. Your datasets should be micro-batched into large files. For instance, the “parquet-1” dataset provided batches >7 million records per file. The optimal file size for Redshift Spectrum is usually in the 100 MB to 1 GB range.

If you have an edge case that may pose scalability challenges, AWS would love to hear about it. For further guidance, talk to your solutions architect.

Environment

The project consists of the following environment:

  • Amazon Redshift cluster: 4 X dc1.large
  • Data:
    • Time and customer dimension tables are stored on all Amazon Redshift nodes (ALL distribution style):
      • The data originates from the DWDATE and CUSTOMER tables in the Star Schema Benchmark
      • The customer table contains attributes for 3 million customers.
      • The time data is at the day-level granularity, and spans 7 years, from the start of 1992 to the end of 1998.
    • The clickstream data is stored in an S3 bucket, and serves as a fact table.
      • Various copies of this dataset in CSV and Parquet format have been provided, for reasons to be discussed later.
      • The data is a modified version of the uservisits dataset from AMPLab’s Big Data Benchmark, which was generated by Intel’s Hadoop benchmark tools.
      • Changes were minimal, so that existing test harnesses for this test can be adapted:
        • Increased the 751,754,869-row dataset 5X to 3,758,774,345 rows.
        • Added surrogate keys to support joins with customer and time dimensions. These keys were distributed evenly across the entire dataset to represents user visits from six customers over seven years.
        • Values for the visitDate column were replaced to align with the 7-year timeframe, and the added time surrogate key.

Queries across the data lake and data warehouse 

Imagine a scenario where a business analyst plans to analyze clickstream metrics like ad revenue over time and by customer, market segment and more. The example below is a query that achieves this effect: 

The query part highlighted in red retrieves clickstream data in S3, and joins the data with the time and customer dimension tables in Amazon Redshift through the part highlighted in blue. The query returns the total ad revenue for three customers over the last three months, along with info on their respective market segment.

Unfortunately, this query takes around three minutes to run, and doesn’t enable the interactive experience that you want. However, there’s a number of performance optimizations that you can implement to achieve the desired performance.

Performance analysis

Two key utilities provide visibility into Redshift Spectrum:

  • EXPLAIN
    Provides the query execution plan, which includes info around what processing is pushed down to Redshift Spectrum. Steps in the plan that include the prefix S3 are executed on Redshift Spectrum. For instance, the plan for the previous query has the step “S3 Seq Scan clickstream.uservisits_csv10”, indicating that Redshift Spectrum performs a scan on S3 as part of the query execution.
  • SVL_S3QUERY_SUMMARY
    Statistics for Redshift Spectrum queries are stored in this table. While the execution plan presents cost estimates, this table stores actual statistics for past query runs.

You can get the statistics of your last query by inspecting the SVL_S3QUERY_SUMMARY table with the condition (query = pg_last_query_id()). Inspecting the previous query reveals that the entire dataset of nearly 3.8 billion rows was scanned to retrieve less than 66.3 million rows. Improving scan selectivity in your query could yield substantial performance improvements.

Partitioning

Partitioning is a key means to improving scan efficiency. In your environment, the data and tables have already been organized, and configured to support partitions. For more information, see the PoC project setup instructions. The clickstream table was defined as:

CREATE EXTERNAL TABLE clickstream.uservisits_csv10
…
PARTITIONED BY(customer int4, visitYearMonth int4)

The entire 3.8 billion-row dataset is organized as a collection of large files where each file contains data exclusive to a particular customer and month in a year. This allows you to partition your data into logical subsets by customer and year/month. With partitions, the query engine can target a subset of files:

  • Only for specific customers
  • Only data for specific months
  • A combination of specific customers and year/months

You can use partitions in your queries. Instead of joining your customer data on the surrogate customer key (that is, c.c_custkey = uv.custKey), the partition key “customer” should be used instead:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ORDER BY c.c_name, c.c_mktsegment, uv.yearMonthKey  ASC

This query should run approximately twice as fast as the previous query. If you look at the statistics for this query in SVL_S3QUERY_SUMMARY, you see that only half the dataset was scanned. This is expected because your query is on three out of six customers on an evenly distributed dataset. However, the scan is still inefficient, and you can benefit from using your year/month partition key as well:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ON uv.visitYearMonth = t.d_yearmonthnum
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

All joins between the tables are now using partitions. Upon reviewing the statistics for this query, you should observe that Redshift Spectrum scans and returns the exact number of rows, 66,270,117. If you run this query a few times, you should see execution time in the range of 8 seconds, which is a 22.5X improvement on your original query!

Predicate pushdown and storage optimizations 

Previously, I mentioned that Redshift Spectrum performs processing through large-scale infrastructure external to your Amazon Redshift cluster. It is optimized for performing large scans and aggregations on S3. In fact, Redshift Spectrum may even out-perform a medium size Amazon Redshift cluster on these types of workloads with the proper optimizations. There are two important variables to consider for optimizing large scans and aggregations:

  • File size and count. As a general rule, use files 100 MB-1 GB in size, as Redshift Spectrum and S3 are optimized for reading this object size. However, the number of files operating on a query is directly correlated with the parallelism achievable by a query. There is an inverse relationship between file size and count: the bigger the files, the fewer files there are for the same dataset. Consequently, there is a trade-off between optimizing for object read performance, and the amount of parallelism achievable on a particular query. Large files are best for large scans as the query likely operates on sufficiently large number of files. For queries that are more selective and for which fewer files are operating, you may find that smaller files allow for more parallelism.
  • Data format. Redshift Spectrum supports various data formats. Columnar formats like Parquet can sometimes lead to substantial performance benefits by providing compression and more efficient I/O for certain workloads. Generally, format types like Parquet should be used for query workloads involving large scans, and high attribute selectivity. Again, there are trade-offs as formats like Parquet require more compute power to process than plaintext. For queries on smaller subsets of data, the I/O efficiency benefit of Parquet is diminished. At some point, Parquet may perform the same or slower than plaintext. Latency, compression rates, and the trade-off between user experience and cost should drive your decision.

To help illustrate how Redshift Spectrum performs on these large aggregation workloads, run a basic query that aggregates the entire ~3.7 billion record dataset on Redshift Spectrum, and compared that with running the query exclusively on Amazon Redshift:

SELECT uv.custKey, COUNT(uv.custKey)
FROM <your clickstream table> as uv
GROUP BY uv.custKey
ORDER BY uv.custKey ASC

For the Amazon Redshift test case, the clickstream data is loaded, and distributed evenly across all nodes (even distribution style) with optimal column compression encodings prescribed by the Amazon Redshift’s ANALYZE command.

The Redshift Spectrum test case uses a Parquet data format with each file containing all the data for a particular customer in a month. This results in files mostly in the range of 220-280 MB, and in effect, is the largest file size for this partitioning scheme. If you run tests with the other datasets provided, you see that this data format and size is optimal and out-performs others by ~60X. 

Performance differences will vary depending on the scenario. The important takeaway is to understand the testing strategy and the workload characteristics where Redshift Spectrum is likely to yield performance benefits. 

The following chart compares the query execution time for the two scenarios. The results indicate that you would have to pay for 12 X DC1.Large nodes to get performance comparable to using a small Amazon Redshift cluster that leverages Redshift Spectrum. 

Chart showing simple aggregation on ~3.7 billion records

So you’ve validated that Spectrum excels at performing large aggregations. Could you benefit by pushing more work down to Redshift Spectrum in your original query? It turns out that you can, by making the following modification:

The clickstream data is stored at a day-level granularity for each customer while your query rolls up the data to the month level per customer. In the earlier query that uses the day/month partition key, you optimized the query so that it only scans and retrieves the data required, but the day level data is still sent back to your Amazon Redshift cluster for joining and aggregation. The query shown here pushes aggregation work down to Redshift Spectrum as indicated by the query plan:

In this query, Redshift Spectrum aggregates the clickstream data to the month level before it is returned to the Amazon Redshift cluster and joined with the dimension tables. This query should complete in about 4 seconds, which is roughly twice as fast as only using the partition key. The speed increase is evident upon reviewing the SVL_S3QUERY_SUMMARY table:

  • Bytes scanned is 21.6X less because of the Parquet data format.
  • Only 90 records are returned back to the Amazon Redshift cluster as a result of the push-down, instead of ~66.2 million, leading to substantially less join overhead, and about 530 MB less data sent back to your cluster.
  • No adverse change in average parallelism.

Assessing the value of Amazon Redshift vs. Redshift Spectrum

At this point, you might be asking yourself, why would I ever not use Redshift Spectrum? Well, you still get additional value for your money by loading data into Amazon Redshift, and querying in Amazon Redshift vs. querying S3.

In fact, it turns out that the last version of our query runs even faster when executed exclusively in native Amazon Redshift, as shown in the following chart:

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 3 months of data

As a general rule, queries that aren’t dominated by I/O and which involve multiple joins are better optimized in native Amazon Redshift. For instance, the performance difference between running the partition key query entirely in Amazon Redshift versus with Redshift Spectrum is twice as large as that that of the pushdown aggregation query, partly because the former case benefits more from better join performance.

Furthermore, the variability in latency in native Amazon Redshift is lower. For use cases where you have tight performance SLAs on queries, you may want to consider using Amazon Redshift exclusively to support those queries.

On the other hand, when you perform large scans, you could benefit from the best of both worlds: higher performance at lower cost. For instance, imagine that you wanted to enable your business analysts to interactively discover insights across a vast amount of historical data. In the example below, the pushdown aggregation query is modified to analyze seven years of data instead of three months:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, uv.totalRevenue
…
WHERE customer <= 3 and visitYearMonth >= 199201
… 
FROM dwdate WHERE d_yearmonthnum >= 199201) as t
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

This query requires scanning and aggregating nearly 1.9 billion records. As shown in the chart below, Redshift Spectrum substantially speeds up this query. A large Amazon Redshift cluster would have to be provisioned to support this use case. With the aid of Redshift Spectrum, you could use an existing small cluster, keep a single copy of your data in S3, and benefit from economical, durable storage while only paying for what you use via the pay per query pricing model.

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 7 years of data

Summary

Redshift Spectrum lowers the time to value for deeper insights on customer data queries spanning the data lake and data warehouse. It can enable interactive analysis on datasets in cases that weren’t economically practical or technically feasible before.

There are cases where you can get the best of both worlds from Redshift Spectrum: higher performance at lower cost. However, there are still latency-sensitive use cases where you may want native Amazon Redshift performance. For more best practice tips, see the 10 Best Practices for Amazon Redshift post.

Please visit the Amazon Redshift Spectrum PoC Environment Github page. If you have questions or suggestions, please comment below.

 


Additional Reading

Learn more about how Amazon Redshift Spectrum extends data warehousing out to exabytes – no loading required.


About the Author

Dylan Tong is an Enterprise Solutions Architect at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

 

 

Sean’s DIY Bitcoin Lottery with a Raspberry Pi

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/seans-diy-bitcoin-lottery/

After several explorations into the world of 3D printing, and fresh off the back of his $5 fidget spinner crowd funding campaign, Sean Hodgins brings us his latest project: a DIY Bitcoin Lottery!

DIY Bitcoin Lottery with a Raspberry Pi

Build your own lottery! Thingiverse Files: https://www.thingiverse.com/thing:2494568 Pi How-to: http://www.idlehandsproject.com/raspberry-pi-bitcoin-lottery/ Instructables: https://www.instructables.com/id/DIY-Bitcoin-Lottery-With-Raspberry-Pi/ Send me bitcoins if you want!

What is Bitcoin mining?

According to the internet, Bitcoin mining is:

[A] record-keeping service. Miners keep the blockchain consistent, complete, and unalterable by repeatedly verifying and collecting newly broadcast transactions into a new group of transactions called a block. Each block contains a cryptographic hash of the previous block, using the SHA-256 hashing algorithm, which links it to the previous block, thus giving the blockchain its name.

If that makes no sense to you, welcome to the club. So here’s a handy video which explains it better.

What is Bitcoin Mining?

For more information: https://www.bitcoinmining.com and https://www.weusecoins.com What is Bitcoin Mining? Have you ever wondered how Bitcoin is generated? This short video is an animated introduction to Bitcoin Mining. Credits: Voice – Chris Rice (www.ricevoice.com) Motion Graphics – Fabian Rühle (www.fabianruehle.de) Music/Sound Design – Christian Barth (www.akkord-arbeiter.de) Andrew Mottl (www.andrewmottl.com)

Okay, now I get it.

I swear.

Sean’s Bitcoin Lottery

As a retired Bitcoin miner, Sean understands how the system works and what is required for mining. And since news sources report that Bitcoin is currently valued at around $4000, Sean decided to use a Raspberry Pi to bring to life an idea he’d been thinking about for a little while.

Sean Hodgins Raspberry Pi Bitcoin Lottery

He fitted the Raspberry Pi into a 3D-printed body, together with a small fan, a strip of NeoPixels, and a Block Eruptor ASIC which is the dedicated mining hardware. The Pi runs a Python script compatible with CGMiner, a mining software that needs far more explanation than I can offer in this short blog post.

The Neopixels take the first 6 characters of the 64-character-long number of the current block, and interpret it as a hex colour code. In this way, the block’s data is converted into colour, which, when you think about it, is kind of beautiful.

The device moves on to trying to solve a new block every 20 minutes. When it does, the NeoPixel LEDs play a flashing ‘Win’ or ‘Lose’ animation to let you know whether you were the one to solve the previous block.

Sean Hodgins Raspberry Pi Bitcoin Lottery

Lottery results

Sean has done the maths to calculate the power consumption of the device. He says that the annual cost of running his Bitcoin Lottery is roughly what you would pay for two lottery scratch cards. Now, the odds of solving a block are much lower than those of buying a winning scratch card. However, since the mining device moves on to a new block every 20 minutes, the odds of being a winner with Bitcoin using Sean’s build are actually better than those of winning the lottery.

Sean Hodgins Raspberry Pi Bitcoin Lottery

MATHS!

But even if you don’t win, Sean’s project is a fun experiment in Bitcoin mining and creating colour through code. And if you want to make your own, you can download the 3D-files here, find the code here, and view the step-by-step guide here on Instructables.

Good luck and happy mining!

The post Sean’s DIY Bitcoin Lottery with a Raspberry Pi appeared first on Raspberry Pi.

Stubbing Key-Value Stores

Post Syndicated from Bozho original https://techblog.bozho.net/stubbing-key-value-stores/

Every project that has a database has dilemma: how to test database-dependent code. There are several options (not mutually exclusive):

  • Use mocks – use only unit tests and mock the data-access layer, assuming the DAO-to-database communication works
  • Use an embedded database that each test starts and shuts down. This can also be viewed as unit-testing
  • Use a real database deployed somewhere (either locally or on a test environment). The hard part is making sure it’s always in a clean state.
  • Use end-to-end/functional tests/bdd/UI tests after deploying the application on a test server (which has a proper database).

None of the above is without problems. Unit tests with mocked DAOs can’t really test more complex interactions that rely on a database state. Embedded databases are not always available (e.g. if you are using a non-relational database, or if you rely on RDBMS-specific functionality, HSQLDB won’t do), or they can be slow to start and this your tests may take too long supporting. A real database installation complicates setup and keeping it clean is not always easy. The coverage of end-to-end tests can’t be easily measured and they don’t necessarily cover all the edge cases, as they are harder to maintain than unit and integration tests.

I’ve recently tried a strange approach that is working pretty well so far – stubbing the database. It is applicable more to key-value stores and less to relational databases.

In my case, even though there is embedded cassandra, it was slow to start, wasn’t easy to setup and had subtle issues. That’s why I replaced the whole thing with an in-memory ConcurrentHashMap.

Since I’m using spring-data-cassandra, I just extended the CassandraTemplate class and implemented all the method in the new StubCassandraTemplate, and used it instead of the regular one in the test spring context. The stub can support all the key/value operations pretty easily and you can have a bit more complicated integration tests (it’s not a good idea to have very complicated tests, of course, but unit tests can either be too simple or too reliant on a lot of mocks). Here’s an excerpt from the code:

@Component("cassandraTemplate")
public class StubCassandraTemplate extends CassandraTemplate {
    
    private Map<Class<?>, Map<Object, Object>> data = new ConcurrentHashMap<>();
    
    @Override
    public void afterPropertiesSet() {
        // no validation
    }
    
    @SuppressWarnings("unchecked")
    @Override
    public <T> T insert(T entity) {
        List<Field> pk = FieldUtils.getFieldsListWithAnnotation(entity.getClass(), PrimaryKey.class);
        initializeClass(entity.getClass());
        try {
            pk.get(0).setAccessible(true);
            return (T) data.get(entity.getClass()).put(pk.get(0).get(entity), entity);
        } catch (IllegalAccessException e) {
            throw new IllegalArgumentException(e);
        }
    }

    private <T> void initializeClass(Class<?> clazz) {
        if (data.get(clazz) == null) {
            data.put(clazz, new ConcurrentHashMap<>());
        }
    }
....
}

Cassandra supports some advanced features like CQL (query language), which isn’t as easy to stub as key-value operations like get and put, but in fact it is not that hard. Especially if you do not rely on complicated where clauses (and this is a bad practice in Cassandra anyway), it’s easy to parse the query with regex and find the appropriate entries in the ConcurrentHashMap.

Key-value stores are a good candidate for this approach, as their main advantage – being easy to scale horizontally – is not needed in an integration test scenario. You simply need to verify that your code correctly handles interactions with the database in terms of what it puts there and what it gets back. The exact implementation of that interaction – whether it’s in-memory or using a binary protocol, may be viewed as out of scope.

Note that these tests do not guarantee that the application will work with a real database. They only guarantee that it will behave properly if the database behaves the same way as an in-memory key-value data structure. Which is normally the assumption, but isn’t always true – e.g. the database can impose additional constraints that your stub implementation doesn’t have. Cassandra, for example, doesn’t allow WHERE queries for non-indexed columns. If you don’t take that into account, obviously, your test will pass, but your application will break.

That’s why you’d still need end-to-end tests and possibly some real integration tests, but you can cover most of the code with a simple in-memory stub and only do some “sanity” full integration tests.

This doesn’t mean you should always stub your database, but it’s a good option in your testing toolbox to consider.

The post Stubbing Key-Value Stores appeared first on Bozho's tech blog.

Nazis, are bad

Post Syndicated from Eevee original https://eev.ee/blog/2017/08/13/nazis-are-bad/

Anonymous asks:

Could you talk about something related to the management/moderation and growth of online communities? IOW your thoughts on online community management, if any.

I think you’ve tweeted about this stuff in the past so I suspect you have thoughts on this, but if not, again, feel free to just blog about … anything 🙂

Oh, I think I have some stuff to say about community management, in light of recent events. None of it hasn’t already been said elsewhere, but I have to get this out.

Hopefully the content warning is implicit in the title.


I am frustrated.

I’ve gone on before about a particularly bothersome phenomenon that hurts a lot of small online communities: often, people are willing to tolerate the misery of others in a community, but then get up in arms when someone pushes back. Someone makes a lot of off-hand, off-color comments about women? Uses a lot of dog-whistle terms? Eh, they’re not bothering anyone, or at least not bothering me. Someone else gets tired of it and tells them to knock it off? Whoa there! Now we have the appearance of conflict, which is unacceptable, and people will turn on the person who’s pissed off — even though they’ve been at the butt end of an invisible conflict for who knows how long. The appearance of peace is paramount, even if it means a large chunk of the population is quietly miserable.

Okay, so now, imagine that on a vastly larger scale, and also those annoying people who know how to skirt the rules are Nazis.


The label “Nazi” gets thrown around a lot lately, probably far too easily. But when I see a group of people doing the Hitler salute, waving large Nazi flags, wearing Nazi armbands styled after the SS, well… if the shoe fits, right? I suppose they might have flown across the country to join a torch-bearing mob ironically, but if so, the joke is going way over my head. (Was the murder ironic, too?) Maybe they’re not Nazis in the sense that the original party doesn’t exist any more, but for ease of writing, let’s refer to “someone who espouses Nazi ideology and deliberately bears a number of Nazi symbols” as, well, “a Nazi”.

This isn’t a new thing, either; I’ve stumbled upon any number of Twitter accounts that are decorated in Nazi regalia. I suppose the trouble arises when perfectly innocent members of the alt-right get unfairly labelled as Nazis.

But hang on; this march was called “Unite the Right” and was intended to bring together various far right sub-groups. So what does their choice of aesthetic say about those sub-groups? I haven’t heard, say, alt-right coiner Richard Spencer denounce the use of Nazi symbology — extra notable since he was fucking there and apparently didn’t care to discourage it.


And so begins the rule-skirting. “Nazi” is definitely overused, but even using it to describe white supremacists who make not-so-subtle nods to Hitler is likely to earn you some sarcastic derailment. A Nazi? Oh, so is everyone you don’t like and who wants to establish a white ethno state a Nazi?

Calling someone a Nazi — or even a white supremacist — is an attack, you see. Merely expressing the desire that people of color not exist is perfectly peaceful, but identifying the sentiment for what it is causes visible discord, which is unacceptable.

These clowns even know this sort of thing and strategize around it. Or, try, at least. Maybe it wasn’t that successful this weekend — though flicking through Charlottesville headlines now, they seem to be relatively tame in how they refer to the ralliers.

I’m reminded of a group of furries — the alt-furries — who have been espousing white supremacy and wearing red armbands with a white circle containing a black… pawprint. Ah, yes, that’s completely different.


So, what to do about this?

Ignore them” is a popular option, often espoused to bullied children by parents who have never been bullied, shortly before they resume complaining about passive-aggressive office politics. The trouble with ignoring them is that, just like in smaller communitiest, they have a tendency to fester. They take over large chunks of influential Internet surface area like 4chan and Reddit; they help get an inept buffoon elected; and then they start to have torch-bearing rallies and run people over with cars.

4chan illustrates a kind of corollary here. Anyone who’s steeped in Internet Culture™ is surely familiar with 4chan; I was never a regular visitor, but it had enough influence that I was still aware of it and some of its culture. It was always thick with irony, which grew into a sort of ironic detachment — perhaps one of the major sources of the recurring online trope that having feelings is bad — which proceeded into ironic racism.

And now the ironic racism is indistinguishable from actual racism, as tends to be the case. Do they “actually” “mean it”, or are they just trying to get a rise out of people? What the hell is unironic racism if not trying to get a rise out of people? What difference is there to onlookers, especially as they move to become increasingly involved with politics?

It’s just a joke” and “it was just a thoughtless comment” are exceptionally common defenses made by people desperate to preserve the illusion of harmony, but the strain of overt white supremacy currently running rampant through the US was built on those excuses.


The other favored option is to debate them, to defeat their ideas with better ideas.

Well, hang on. What are their ideas, again? I hear they were chanting stuff like “go back to Africa” and “fuck you, faggots”. Given that this was an overtly political rally (and again, the Nazi fucking regalia), I don’t think it’s a far cry to describe their ideas as “let’s get rid of black people and queer folks”.

This is an underlying proposition: that white supremacy is inherently violent. After all, if the alt-right seized total political power, what would they do with it? If I asked the same question of Democrats or Republicans, I’d imagine answers like “universal health care” or “screw over poor people”. But people whose primary goal is to have a country full of only white folks? What are they going to do, politely ask everyone else to leave? They’re invoking the memory of people who committed genocide and also tried to take over the fucking world. They are outright saying, these are the people we look up to, this is who we think had a great idea.

How, precisely, does one defeat these ideas with rational debate?

Because the underlying core philosophy beneath all this is: “it would be good for me if everything were about me”. And that’s true! (Well, it probably wouldn’t work out how they imagine in practice, but it’s true enough.) Consider that slavery is probably fantastic if you’re the one with the slaves; the issue is that it’s reprehensible, not that the very notion contains some kind of 101-level logical fallacy. That’s probably why we had a fucking war over it instead of hashing it out over brunch.

…except we did hash it out over brunch once, and the result was that slavery was still allowed but slaves only counted as 60% of a person for the sake of counting how much political power states got. So that’s how rational debate worked out. I’m sure the slaves were thrilled with that progress.


That really only leaves pushing back, which raises the question of how to push back.

And, I don’t know. Pushing back is much harder in spaces you don’t control, spaces you’re already struggling to justify your own presence in. For most people, that’s most spaces. It’s made all the harder by that tendency to preserve illusory peace; even the tamest request that someone knock off some odious behavior can be met by pushback, even by third parties.

At the same time, I’m aware that white supremacists prey on disillusioned young white dudes who feel like they don’t fit in, who were promised the world and inherited kind of a mess. Does criticism drive them further away? The alt-right also opposes “political correctness”, i.e. “not being a fucking asshole”.

God knows we all suck at this kind of behavior correction, even within our own in-groups. Fandoms have become almost ridiculously vicious as platforms like Twitter and Tumblr amplify individual anger to deafening levels. It probably doesn’t help that we’re all just exhausted, that every new fuck-up feels like it bears the same weight as the last hundred combined.

This is the part where I admit I don’t know anything about people and don’t have any easy answers. Surprise!


The other alternative is, well, punching Nazis.

That meme kind of haunts me. It raises really fucking complicated questions about when violence is acceptable, in a culture that’s completely incapable of answering them.

America’s relationship to violence is so bizarre and two-faced as to be almost incomprehensible. We worship it. We have the biggest military in the world by an almost comical margin. It’s fairly mainstream to own deadly weapons for the express stated purpose of armed revolution against the government, should that become necessary, where “necessary” is left ominously undefined. Our movies are about explosions and beating up bad guys; our video games are about explosions and shooting bad guys. We fantasize about solving foreign policy problems by nuking someone — hell, our talking heads are currently in polite discussion about whether we should nuke North Korea and annihilate up to twenty-five million people, as punishment for daring to have the bomb that only we’re allowed to have.

But… violence is bad.

That’s about as far as the other side of the coin gets. It’s bad. We condemn it in the strongest possible terms. Also, guess who we bombed today?

I observe that the one time Nazis were a serious threat, America was happy to let them try to take over the world until their allies finally showed up on our back porch.

Maybe I don’t understand what “violence” means. In a quest to find out why people are talking about “leftist violence” lately, I found a National Review article from May that twice suggests blocking traffic is a form of violence. Anarchists have smashed some windows and set a couple fires at protests this year — and, hey, please knock that crap off? — which is called violence against, I guess, Starbucks. Black Lives Matter could be throwing a birthday party and Twitter would still be abuzz with people calling them thugs.

Meanwhile, there’s a trend of murderers with increasingly overt links to the alt-right, and everyone is still handling them with kid gloves. First it was murders by people repeating their talking points; now it’s the culmination of a torches-and-pitchforks mob. (Ah, sorry, not pitchforks; assault rifles.) And we still get this incredibly bizarre both-sides-ism, a White House that refers to the people who didn’t murder anyone as “just as violent if not more so“.


Should you punch Nazis? I don’t know. All I know is that I’m extremely dissatisfied with discourse that’s extremely alarmed by hypothetical punches — far more mundane than what you’d see after a sporting event — but treats a push for ethnic cleansing as a mere difference of opinion.

The equivalent to a punch in an online space is probably banning, which is almost laughable in comparison. It doesn’t cause physical harm, but it is a use of concrete force. Doesn’t pose quite the same moral quandary, though.

Somewhere in the middle is the currently popular pastime of doxxing (doxxxxxxing) people spotted at the rally in an attempt to get them fired or whatever. Frankly, that skeeves me out, though apparently not enough that I’m directly chastizing anyone for it.


We aren’t really equipped, as a society, to deal with memetic threats. We aren’t even equipped to determine what they are. We had a fucking world war over this, and now people are outright saying “hey I’m like those people we went and killed a lot in that world war” and we give them interviews and compliment their fashion sense.

A looming question is always, what if they then do it to you? What if people try to get you fired, to punch you for your beliefs?

I think about that a lot, and then I remember that it’s perfectly legal to fire someone for being gay in half the country. (Courts are currently wrangling whether Title VII forbids this, but with the current administration, I’m not optimistic.) I know people who’ve been fired for coming out as trans. I doubt I’d have to look very far to find someone who’s been punched for either reason.

And these aren’t even beliefs; they’re just properties of a person. You can stop being a white supremacist, one of those people yelling “fuck you, faggots”.

So I have to recuse myself from this asinine question, because I can’t fairly judge the risk of retaliation when it already happens to people I care about.

Meanwhile, if a white supremacist does get punched, I absolutely still want my tax dollars to pay for their universal healthcare.


The same wrinkle comes up with free speech, which is paramount.

The ACLU reminds us that the First Amendment “protects vile, hateful, and ignorant speech”. I think they’ve forgotten that that’s a side effect, not the goal. No one sat down and suggested that protecting vile speech was some kind of noble cause, yet that’s how we seem to be treating it.

The point was to avoid a situation where the government is arbitrarily deciding what qualifies as vile, hateful, and ignorant, and was using that power to eliminate ideas distasteful to politicians. You know, like, hypothetically, if they interrogated and jailed a bunch of people for supporting the wrong economic system. Or convicted someone under the Espionage Act for opposing the draft. (Hey, that’s where the “shouting fire in a crowded theater” line comes from.)

But these are ideas that are already in the government. Bannon, a man who was chair of a news organization he himself called “the platform for the alt-right”, has the President’s ear! How much more mainstream can you get?

So again I’m having a little trouble balancing “we need to defend the free speech of white supremacists or risk losing it for everyone” against “we fairly recently were ferreting out communists and the lingering public perception is that communists are scary, not that the government is”.


This isn’t to say that freedom of speech is bad, only that the way we talk about it has become fanatical to the point of absurdity. We love it so much that we turn around and try to apply it to corporations, to platforms, to communities, to interpersonal relationships.

Look at 4chan. It’s completely public and anonymous; you only get banned for putting the functioning of the site itself in jeopardy. Nothing is stopping a larger group of people from joining its politics board and tilting sentiment the other way — except that the current population is so odious that no one wants to be around them. Everyone else has evaporated away, as tends to happen.

Free speech is great for a government, to prevent quashing politics that threaten the status quo (except it’s a joke and they’ll do it anyway). People can’t very readily just bail when the government doesn’t like them, anyway. It’s also nice to keep in mind to some degree for ubiquitous platforms. But the smaller you go, the easier it is for people to evaporate away, and the faster pure free speech will turn the place to crap. You’ll be left only with people who care about nothing.


At the very least, it seems clear that the goal of white supremacists is some form of destabilization, of disruption to the fabric of a community for purely selfish purposes. And those are the kinds of people you want to get rid of as quickly as possible.

Usually this is hard, because they act just nicely enough to create some plausible deniability. But damn, if someone is outright telling you they love Hitler, maybe skip the principled hand-wringing and eject them.

How To Send Ethereum Transactions With Java

Post Syndicated from Bozho original https://techblog.bozho.net/send-ethereum-transactions-java/

After I’ve expressed my concerns about the blockchain technology, let’s get a bit more practical with the blockchain. In particular, with Ethereum.

I needed to send a transaction with Java, so I looked at EthereumJ. You have three options:

  • Full node – you enable syncing, which means the whole blockchain gets downloaded. It takes a lot of time, so I abandoned that approach
  • “Light” node – you disable syncing, so you just become part of the network, but don’t fetch any parts of the chain. Not entirely sure, but I think this corresponds to the “light” mode of geth (the ethereum CLI). You are able to send messages (e.g. transaction messages) to other peers to process and store on the blockchain, but you yourself do not have the blockchain.
  • Offline (no node) – just create and sign the transaction, compute its raw representation (in the ethereum RLP format) and push it to the blockchain via a centralized API, e.g. the etherscan.io API. Etherscan is itself a node on the network and it can perform all of the operations (so it serves as a proxy)

Before going further, maybe it’s worth pointing out a few general properties of the blockchain (the ethereum one and popular cryptocurrencies at least) – it is a distributed database, relying on a peer-to-peer (overlay) network, formed by whoever has a client software running (wallet or otherwise). Transactions are in the form of “I (private key owner) want to send this amount to that address”. Transactions can have additional data stored inside them, e.g. representing what they are about. Transactions then get verified by peers (currently using a Proof-of-work based consensus) and get stored on the blockchain, which means every connected peer gets the newly created blocks (each block consisting of multiple transactions). That’s the blockchain in short, and Ethereum is no exception.

Why you may want to send transactions? I can’t think of a simple and obvious use-case, maybe you just want to implement a better wallet than the existing ones. For example in my case I wanted to store the head of a hash chain on the blockchain so that it cannot be tampered with.

In my particular case I was more interested in storing a particular piece of data as part of the transaction, rather than the transaction itself, so I had two nodes that sent very small transactions to each other (randomly choosing sender and recipient). I know I could probably have done that with a smart contract instead, but “one step at a time”. The initial code can be found here, and is based heavily on the EthereumJ samples. Since EthereumJ uses spring internally, and my application uses spring, it took some extra effort to allow for two nodes, but that’s not so relevant to the task at hand. The most important piece of the code can be seen further below in this post, only slightly modified.

You should have a user.conf file on the classpath with some defaults, and it can be based on the default ethereumj config. The more important part is the external user1 and user2 conf files (which in the general scenario can just be one conf file). Here’s a sample one, with the following important parameters:

  • peer.networkId – whether you are using the real production network (=1), or a test network (=3). Obviously, for anything than production you’d want a test network. On test networks you can get free ether by utilizing a faucet. In order to use a test network there are two more parameters below – blockchain.config.name = ropsten and genesis = ropsten.json. Note that there are more test networks at the moment, for experimenting with alternatives to proof-of-work.
  • peer.privateKey – this is the most important bit. It is your secret key which gives you control over your blockchain “account”. Only using that private key you can sign transactions (using an ellptic curve algorithm). The private key has a corresponding public key, which is basically your address on the network – if anyone wants to send funds, he sends them to your public key. But only you can then send funds from your account, as nobody else owns the private key. Which means you have to protect it. In this case it’s in plaintext in a file, which may not be ideal if you operate with big amounts of ether. Consider using some key-management solution (as outlined here)
  • peer.ip.list – this is optional, but preferable – you need to have a list of peers to connect to in order to bootstrap your client and make it part of the network. The peers there are connected to other peers, and so on, and so forth, so in the end it’s a single interconnected network. Note that in combination with the port number, that requires some additional network configuration if you are using that on a server/cluster/stack – you’d have to open some ports and allow outgoing and incoming connections.
  • database.dir – this is the directory where the blockchain and the list of discovered peers will be stored. It uses leveldb, and what I found out is that ethereumj uses an outdated leveldb which didn’t work on my machine. So I excluded them and manually used newer versions
  • sync.enabled – whether you want to fetch the blockchain or not. Normally you don’t need to, as it takes a lot of time, but that way you are not a full node and don’t contribute to the network.

As I noted earlier, I didn’t need a full node, I just needed to send a transaction. The light node would do (the difference should be simply switching sync.enabled from true to false), but after initially successfully connecting to peers, I started getting weird exceptions I didn’t have time to go into, so I couldn’t join the network anymore (maybe because of the crappy wifi I’m currently using).

Fortunately, there is a completely “offline” approach – use an external API to publish your transactions. All you need is your private key and a library (EthereumJ in this case) to prepare your transaction. So you can forget everything you read in the previous paragraphs. What you need is just the RLP encoded transaction after you have signed it. E.g.:

byte[] nonce = ByteUtil.intToBytesNoLoadZeroes(getTransactionCount(senderAddress) + 1);
byte[] gasPrice = getGasPrice();
Transaction tx = new Transaction(
    nonce,
    gasPrice,
    ByteUtil.longToBytesNoLeadZeroes(200000),
    receiverAddress,
    ByteUtil.bigIntegerToBytes(BigInteger.valueOf(1)),  // 1 gwei
    data.getBytes(StandardCharsets.UTF_8),
    CHAIN_ID);
            
tx.sign(ECKey.fromPrivate(senderPrivateKey));
            
byte[] rawTx = tx.getEncoded();
            
restTemplate.getForObject(etherscanUrl, String.class, "0x" + BaseEncoding.base16().encode(rawTx));

In this example, I use the Etherscan.io API (there’s also a test one for the Ropsten network). Note: it doesn’t seem to be documented, but you have to pass a User-Agent header that matches your application name. It also has a manual entry form to test your transactions (the link is for the Ropsten test network).

What are the parameters above?

  • nonce – this is a sequence number for transactions per user (=per private key). Each subsequent transaction should have a nonce that is the nonce of the previous + 1. That way nobody can replay the same transaction and drain the funds of the sender (the transaction that gets signed contains the nonce, so you cannot use the same raw transaction representation and just resubmit it). How to obtain the nonce? If you are connected to the Ethereum network, there’s a ethereum.getRepository().getNonce(fromAddress);. However, in a disconnected scenario, you’d need to obtain the current number of transactions for the sender, and then increment it. This is done via the eth_getTransactionCount endpoint. Note that it’s returned as hexadecimal, so you have to parse it, e.g. {"jsonrpc":"2.0","result":"0x1","id":73}
  • gas price, maximum gas price – these are used to cover the transaction costs (sending isn’t for free). You can read more here. You can obtain the current gas price by calling the “eth_gasPrice” API endpoint. Probably it’s a good idea to actually fetch the gas price periodically and cache it for a short period, rather than fetching it for every transaction. If you are connected to the network, you can obtain the gas price automatically.
  • receiverAddress – a byte array representing the public key of the recipient
  • value – how much ether you want to send. The smallest unit is actually a “gwei”, and the value is specified in gweis (a fraction of 1 ETH)
  • data – any additional data that you want to put in the transaction.
  • chainId – this is again related to which network you are using. Production=1, Ropsten test network=3. If you are curious why you have to encode it in a transaction, you can read here.

After that you sign the raw representation of the transaction with your private key (the raw representation is RLP (Recursive Length Prefix)). And then you send it to the API (you’d need a key for that, which you can get at Etherscan and include it in the URL). It’s almost identical to what you would’ve done if you were connected. But now you are relying on a central party (Etherscan) instead of becoming part of the network.

It may look “easy”, and when you’ve already done it and grasped it, it sounds like a piece of cake, but there are too many details that nobody abstracts from you, so you have to have the full picture before even being able to push a single transaction. What a nonce is, what a chainId is, what a test network is, how to get test ether (the top google result for a ropsten faucet doesn’t work at the moment, so you have to figure that out as well), then figure out whether you want to sync the chain or not, to be part of the network or not, to resolve weird connectivity issues and network configuration. And that’s not even mentioning smart contracts. I’m not saying it’s bad, it’s just not simple enough and that’s a barrier to wider adoption. That probably applies to most of programming, though. Anyway, I hope the above examples can get people started more easily.

The post How To Send Ethereum Transactions With Java appeared first on Bozho's tech blog.

Hacking Slot Machines by Reverse-Engineering the Random Number Generators

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/08/hacking_slot_ma.html

Interesting story:

The venture is built on Alex’s talent for reverse engineering the algorithms — known as pseudorandom number generators, or PRNGs — that govern how slot machine games behave. Armed with this knowledge, he can predict when certain games are likeliest to spit out money­insight that he shares with a legion of field agents who do the organization’s grunt work.

These agents roam casinos from Poland to Macau to Peru in search of slots whose PRNGs have been deciphered by Alex. They use phones to record video of a vulnerable machine in action, then transmit the footage to an office in St. Petersburg. There, Alex and his assistants analyze the video to determine when the games’ odds will briefly tilt against the house. They then send timing data to a custom app on an agent’s phone; this data causes the phones to vibrate a split second before the agent should press the “Spin” button. By using these cues to beat slots in multiple casinos, a four-person team can earn more than $250,000 a week.

It’s an interesting article; I have no idea how much of it is true.

The sad part is that the slot-machine vulnerability is so easy to fix. Although the article says that “writing such algorithms requires tremendous mathematical skill,” it’s really only true that designing the algorithms requires that skill. Using any secure encryption algorithm or hash function as a PRNG is trivially easy. And there’s no reason why the system can’t be designed with a real RNG. There is some randomness in the system somewhere, and it can be added into the mix as well. The programmers can use a well-designed algorithm, like my own Fortuna, but even something less well-thought-out is likely to foil this attack.

Top 10 Most Obvious Hacks of All Time (v0.9)

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/top-10-most-obvious-hacks-of-all-time.html

For teaching hacking/cybersecurity, I thought I’d create of the most obvious hacks of all time. Not the best hacks, the most sophisticated hacks, or the hacks with the biggest impact, but the most obvious hacks — ones that even the least knowledgeable among us should be able to understand. Below I propose some hacks that fit this bill, though in no particular order.

The reason I’m writing this is that my niece wants me to teach her some hacking. I thought I’d start with the obvious stuff first.

Shared Passwords

If you use the same password for every website, and one of those websites gets hacked, then the hacker has your password for all your websites. The reason your Facebook account got hacked wasn’t because of anything Facebook did, but because you used the same email-address and password when creating an account on “beagleforums.com”, which got hacked last year.

I’ve heard people say “I’m sure, because I choose a complex password and use it everywhere”. No, this is the very worst thing you can do. Sure, you can the use the same password on all sites you don’t care much about, but for Facebook, your email account, and your bank, you should have a unique password, so that when other sites get hacked, your important sites are secure.

And yes, it’s okay to write down your passwords on paper.

Tools: HaveIBeenPwned.com

PIN encrypted PDFs

My accountant emails PDF statements encrypted with the last 4 digits of my Social Security Number. This is not encryption — a 4 digit number has only 10,000 combinations, and a hacker can guess all of them in seconds.
PIN numbers for ATM cards work because ATM machines are online, and the machine can reject your card after four guesses. PIN numbers don’t work for documents, because they are offline — the hacker has a copy of the document on their own machine, disconnected from the Internet, and can continue making bad guesses with no restrictions.
Passwords protecting documents must be long enough that even trillion upon trillion guesses are insufficient to guess.

Tools: Hashcat, John the Ripper

SQL and other injection

The lazy way of combining websites with databases is to combine user input with an SQL statement. This combines code with data, so the obvious consequence is that hackers can craft data to mess with the code.
No, this isn’t obvious to the general public, but it should be obvious to programmers. The moment you write code that adds unfiltered user-input to an SQL statement, the consequence should be obvious. Yet, “SQL injection” has remained one of the most effective hacks for the last 15 years because somehow programmers don’t understand the consequence.
CGI shell injection is a similar issue. Back in early days, when “CGI scripts” were a thing, it was really important, but these days, not so much, so I just included it with SQL. The consequence of executing shell code should’ve been obvious, but weirdly, it wasn’t. The IT guy at the company I worked for back in the late 1990s came to me and asked “this guy says we have a vulnerability, is he full of shit?”, and I had to answer “no, he’s right — obviously so”.

XSS (“Cross Site Scripting”) [*] is another injection issue, but this time at somebody’s web browser rather than a server. It works because websites will echo back what is sent to them. For example, if you search for Cross Site Scripting with the URL https://www.google.com/search?q=cross+site+scripting, then you’ll get a page back from the server that contains that string. If the string is JavaScript code rather than text, then some servers (thought not Google) send back the code in the page in a way that it’ll be executed. This is most often used to hack somebody’s account: you send them an email or tweet a link, and when they click on it, the JavaScript gives control of the account to the hacker.

Cross site injection issues like this should probably be their own category, but I’m including it here for now.

More: Wikipedia on SQL injection, Wikipedia on cross site scripting.
Tools: Burpsuite, SQLmap

Buffer overflows

In the C programming language, programmers first create a buffer, then read input into it. If input is long than the buffer, then it overflows. The extra bytes overwrite other parts of the program, letting the hacker run code.
Again, it’s not a thing the general public is expected to know about, but is instead something C programmers should be expected to understand. They should know that it’s up to them to check the length and stop reading input before it overflows the buffer, that there’s no language feature that takes care of this for them.
We are three decades after the first major buffer overflow exploits, so there is no excuse for C programmers not to understand this issue.

What makes particular obvious is the way they are wrapped in exploits, like in Metasploit. While the bug itself is obvious that it’s a bug, actually exploiting it can take some very non-obvious skill. However, once that exploit is written, any trained monkey can press a button and run the exploit. That’s where we get the insult “script kiddie” from — referring to wannabe-hackers who never learn enough to write their own exploits, but who spend a lot of time running the exploit scripts written by better hackers than they.

More: Wikipedia on buffer overflow, Wikipedia on script kiddie,  “Smashing The Stack For Fun And Profit” — Phrack (1996)
Tools: bash, Metasploit

SendMail DEBUG command (historical)

The first popular email server in the 1980s was called “SendMail”. It had a feature whereby if you send a “DEBUG” command to it, it would execute any code following the command. The consequence of this was obvious — hackers could (and did) upload code to take control of the server. This was used in the Morris Worm of 1988. Most Internet machines of the day ran SendMail, so the worm spread fast infecting most machines.
This bug was mostly ignored at the time. It was thought of as a theoretical problem, that might only rarely be used to hack a system. Part of the motivation of the Morris Worm was to demonstrate that such problems was to demonstrate the consequences — consequences that should’ve been obvious but somehow were rejected by everyone.

More: Wikipedia on Morris Worm

Email Attachments/Links

I’m conflicted whether I should add this or not, because here’s the deal: you are supposed to click on attachments and links within emails. That’s what they are there for. The difference between good and bad attachments/links is not obvious. Indeed, easy-to-use email systems makes detecting the difference harder.
On the other hand, the consequences of bad attachments/links is obvious. That worms like ILOVEYOU spread so easily is because people trusted attachments coming from their friends, and ran them.
We have no solution to the problem of bad email attachments and links. Viruses and phishing are pervasive problems. Yet, we know why they exist.

Default and backdoor passwords

The Mirai botnet was caused by surveillance-cameras having default and backdoor passwords, and being exposed to the Internet without a firewall. The consequence should be obvious: people will discover the passwords and use them to take control of the bots.
Surveillance-cameras have the problem that they are usually exposed to the public, and can’t be reached without a ladder — often a really tall ladder. Therefore, you don’t want a button consumers can press to reset to factory defaults. You want a remote way to reset them. Therefore, they put backdoor passwords to do the reset. Such passwords are easy for hackers to reverse-engineer, and hence, take control of millions of cameras across the Internet.
The same reasoning applies to “default” passwords. Many users will not change the defaults, leaving a ton of devices hackers can hack.

Masscan and background radiation of the Internet

I’ve written a tool that can easily scan the entire Internet in a short period of time. It surprises people that this possible, but it obvious from the numbers. Internet addresses are only 32-bits long, or roughly 4 billion combinations. A fast Internet link can easily handle 1 million packets-per-second, so the entire Internet can be scanned in 4000 seconds, little more than an hour. It’s basic math.
Because it’s so easy, many people do it. If you monitor your Internet link, you’ll see a steady trickle of packets coming in from all over the Internet, especially Russia and China, from hackers scanning the Internet for things they can hack.
People’s reaction to this scanning is weirdly emotional, taking is personally, such as:
  1. Why are they hacking me? What did I do to them?
  2. Great! They are hacking me! That must mean I’m important!
  3. Grrr! How dare they?! How can I hack them back for some retribution!?

I find this odd, because obviously such scanning isn’t personal, the hackers have no idea who you are.

Tools: masscan, firewalls

Packet-sniffing, sidejacking

If you connect to the Starbucks WiFi, a hacker nearby can easily eavesdrop on your network traffic, because it’s not encrypted. Windows even warns you about this, in case you weren’t sure.

At DefCon, they have a “Wall of Sheep”, where they show passwords from people who logged onto stuff using the insecure “DefCon-Open” network. Calling them “sheep” for not grasping this basic fact that unencrypted traffic is unencrypted.

To be fair, it’s actually non-obvious to many people. Even if the WiFi itself is not encrypted, SSL traffic is. They expect their services to be encrypted, without them having to worry about it. And in fact, most are, especially Google, Facebook, Twitter, Apple, and other major services that won’t allow you to log in anymore without encryption.

But many services (especially old ones) may not be encrypted. Unless users check and verify them carefully, they’ll happily expose passwords.

What’s interesting about this was 10 years ago, when most services which only used SSL to encrypt the passwords, but then used unencrypted connections after that, using “cookies”. This allowed the cookies to be sniffed and stolen, allowing other people to share the login session. I used this on stage at BlackHat to connect to somebody’s GMail session. Google, and other major websites, fixed this soon after. But it should never have been a problem — because the sidejacking of cookies should have been obvious.

Tools: Wireshark, dsniff

Stuxnet LNK vulnerability

Again, this issue isn’t obvious to the public, but it should’ve been obvious to anybody who knew how Windows works.
When Windows loads a .dll, it first calls the function DllMain(). A Windows link file (.lnk) can load icons/graphics from the resources in a .dll file. It does this by loading the .dll file, thus calling DllMain. Thus, a hacker could put on a USB drive a .lnk file pointing to a .dll file, and thus, cause arbitrary code execution as soon as a user inserted a drive.
I say this is obvious because I did this, created .lnks that pointed to .dlls, but without hostile DllMain code. The consequence should’ve been obvious to me, but I totally missed the connection. We all missed the connection, for decades.

Social Engineering and Tech Support [* * *]

After posting this, many people have pointed out “social engineering”, especially of “tech support”. This probably should be up near #1 in terms of obviousness.

The classic example of social engineering is when you call tech support and tell them you’ve lost your password, and they reset it for you with minimum of questions proving who you are. For example, you set the volume on your computer really loud and play the sound of a crying baby in the background and appear to be a bit frazzled and incoherent, which explains why you aren’t answering the questions they are asking. They, understanding your predicament as a new parent, will go the extra mile in helping you, resetting “your” password.

One of the interesting consequences is how it affects domain names (DNS). It’s quite easy in many cases to call up the registrar and convince them to transfer a domain name. This has been used in lots of hacks. It’s really hard to defend against. If a registrar charges only $9/year for a domain name, then it really can’t afford to provide very good tech support — or very secure tech support — to prevent this sort of hack.

Social engineering is such a huge problem, and obvious problem, that it’s outside the scope of this document. Just google it to find example after example.

A related issue that perhaps deserves it’s own section is OSINT [*], or “open-source intelligence”, where you gather public information about a target. For example, on the day the bank manager is out on vacation (which you got from their Facebook post) you show up and claim to be a bank auditor, and are shown into their office where you grab their backup tapes. (We’ve actually done this).

More: Wikipedia on Social Engineering, Wikipedia on OSINT, “How I Won the Defcon Social Engineering CTF” — blogpost (2011), “Questioning 42: Where’s the Engineering in Social Engineering of Namespace Compromises” — BSidesLV talk (2016)

Blue-boxes (historical) [*]

Telephones historically used what we call “in-band signaling”. That’s why when you dial on an old phone, it makes sounds — those sounds are sent no differently than the way your voice is sent. Thus, it was possible to make tone generators to do things other than simply dial calls. Early hackers (in the 1970s) would make tone-generators called “blue-boxes” and “black-boxes” to make free long distance calls, for example.

These days, “signaling” and “voice” are digitized, then sent as separate channels or “bands”. This is call “out-of-band signaling”. You can’t trick the phone system by generating tones. When your iPhone makes sounds when you dial, it’s entirely for you benefit and has nothing to do with how it signals the cell tower to make a call.

Early hackers, like the founders of Apple, are famous for having started their careers making such “boxes” for tricking the phone system. The problem was obvious back in the day, which is why as the phone system moves from analog to digital, the problem was fixed.

More: Wikipedia on blue box, Wikipedia article on Steve Wozniak.

Thumb drives in parking lots [*]

A simple trick is to put a virus on a USB flash drive, and drop it in a parking lot. Somebody is bound to notice it, stick it in their computer, and open the file.

This can be extended with tricks. For example, you can put a file labeled “third-quarter-salaries.xlsx” on the drive that required macros to be run in order to open. It’s irresistible to other employees who want to know what their peers are being paid, so they’ll bypass any warning prompts in order to see the data.

Another example is to go online and get custom USB sticks made printed with the logo of the target company, making them seem more trustworthy.

We also did a trick of taking an Adobe Flash game “Punch the Monkey” and replaced the monkey with a logo of a competitor of our target. They now only played the game (infecting themselves with our virus), but gave to others inside the company to play, infecting others, including the CEO.

Thumb drives like this have been used in many incidents, such as Russians hacking military headquarters in Afghanistan. It’s really hard to defend against.

More: “Computer Virus Hits U.S. Military Base in Afghanistan” — USNews (2008), “The Return of the Worm That Ate The Pentagon” — Wired (2011), DoD Bans Flash Drives — Stripes (2008)

Googling [*]

Search engines like Google will index your website — your entire website. Frequently companies put things on their website without much protection because they are nearly impossible for users to find. But Google finds them, then indexes them, causing them to pop up with innocent searches.
There are books written on “Google hacking” explaining what search terms to look for, like “not for public release”, in order to find such documents.

More: Wikipedia entry on Google Hacking, “Google Hacking” book.

URL editing [*]

At the top of every browser is what’s called the “URL”. You can change it. Thus, if you see a URL that looks like this:

http://www.example.com/documents?id=138493

Then you can edit it to see the next document on the server:

http://www.example.com/documents?id=138494

The owner of the website may think they are secure, because nothing points to this document, so the Google search won’t find it. But that doesn’t stop a user from manually editing the URL.
An example of this is a big Fortune 500 company that posts the quarterly results to the website an hour before the official announcement. Simply editing the URL from previous financial announcements allows hackers to find the document, then buy/sell the stock as appropriate in order to make a lot of money.
Another example is the classic case of Andrew “Weev” Auernheimer who did this trick in order to download the account email addresses of early owners of the iPad, including movie stars and members of the Obama administration. It’s an interesting legal case because on one hand, techies consider this so obvious as to not be “hacking”. On the other hand, non-techies, especially judges and prosecutors, believe this to be obviously “hacking”.

DDoS, spoofing, and amplification [*]

For decades now, online gamers have figured out an easy way to win: just flood the opponent with Internet traffic, slowing their network connection. This is called a DoS, which stands for “Denial of Service”. DoSing game competitors is often a teenager’s first foray into hacking.
A variant of this is when you hack a bunch of other machines on the Internet, then command them to flood your target. (The hacked machines are often called a “botnet”, a network of robot computers). This is called DDoS, or “Distributed DoS”. At this point, it gets quite serious, as instead of competitive gamers hackers can take down entire businesses. Extortion scams, DDoSing websites then demanding payment to stop, is a common way hackers earn money.
Another form of DDoS is “amplification”. Sometimes when you send a packet to a machine on the Internet it’ll respond with a much larger response, either a very large packet or many packets. The hacker can then send a packet to many of these sites, “spoofing” or forging the IP address of the victim. This causes all those sites to then flood the victim with traffic. Thus, with a small amount of outbound traffic, the hacker can flood the inbound traffic of the victim.
This is one of those things that has worked for 20 years, because it’s so obvious teenagers can do it, yet there is no obvious solution. President Trump’s executive order of cyberspace specifically demanded that his government come up with a report on how to address this, but it’s unlikely that they’ll come up with any useful strategy.

More: Wikipedia on DDoS, Wikipedia on Spoofing

Conclusion

Tweet me (@ErrataRob) your obvious hacks, so I can add them to the list.

Concerns About The Blockchain Technology

Post Syndicated from Bozho original https://techblog.bozho.net/concerns-blockchain-technology/

The so-called (and marketing-branded) “blockchain technology” is promised to revolutionize every industry. Anything, they say, will become decentralized, free from middle men or government control. Services will thrive on various installments of the blockchain, and smart contracts will automatically enforce any logic that is related to the particular domain.

I don’t mind having another technological leap (after the internet), and given that I’m technically familiar with the blockchain, I may even be part of it. But I’m not convinced it will happen, and I’m not convinced it’s going to be the next internet.

If we strip the hype, the technology behind Bitcoin is indeed a technical masterpiece. It combines existing techniques (likes hash chains and merkle trees) with a very good proof-of-work based consensus algorithm. And it creates a digital currency, which ontop of being worth billions now, is simply cool.

But will this technology be mass-adopted, and will mass adoption allow it to retain the technological benefits it has?

First, I’d like to nitpick a little bit – if anyone is speaking about “decentralized software” when referring to “the blockchain”, be suspicious. Bitcon and other peer-to-peer overlay networks are in fact “distributed” (see the pictures here). “Decentralized” means having multiple providers, but doesn’t mean each user will be full-featured nodes on the network. This nitpicking is actually part of another argument, but we’ll get to that.

If blockchain-based applications want to reach mass adoption, they have to be user-friendly. I know I’m being captain obvious here (and fortunately some of the people in the area have realized that), but with the current state of the technology, it’s impossible for end users to even get it, let alone use it.

My first serious concern is usability. To begin with, you need to download the whole blockchain on your machine. When I got my first bitcoin several years ago (when it was still 10 euro), the blockchain was kind of small and I didn’t notice that problem. Nowadays both the Bitcoin and Ethereum blockchains take ages to download. I still haven’t managed to download the ethereum one – after several bugs and reinstalls of the client, I’m still at 15%. And we are just at the beginning. A user just will not wait for days to download something in order to be able to start using a piece of technology.

I recently proposed downloading snapshots of the blockchain via bittorrent to be included in the Ethereum protocol itself. I know that snapshots of the Bitcoin blockchain have been distributed that way, but it has been a manual process. If a client can quickly download the huge file up to a recent point, and then only donwload the latest ones in the the traditional way, starting up may be easier. Of course, the whole chain would have to be verified, but maybe that can be a background process that doesn’t stop you from using whatever is built ontop of the particular blockchain. (I’m not sure if that will be secure enough, and that, say potential Sybil attacks on the bittorrent part won’t make it undesirable, it’s just an idea).

But even if such an approach works and is adopted, that would still mean that for every service you’d have to download a separate blockchain. Of course, projects like Ethereum may seem like the “one stop shop” for cool blockchain-based applications, but fragmentation is already happening – there are alt-coins bundled with various services like file storage, DNS, etc. That will not be workable for end-users. And it’s certainly not an option for mobile, which is the dominant client now. If instead of downloading the entire chain, something like consistent hashing is used to distribute the content in small portions among clients, it might be workable. But how will trust work in that case, I don’t know. Maybe it’s possible, maybe not.

And yes, I know that you don’t necessarily have to install a wallet/client in order to make use of a given blockchain – you can just have a cloud-based wallet. Which is fairly convenient, but that gets me to my nitpicking from a few paragraphs above and to may second concern – this effectively turns a distributed system into a decentralized one – a limited number of cloud providers hold most of the data (just as a limited number of miners hold most of the processing power). And then, even though the underlying technology allows for a distributed deployment, we’ll end-up again with simply decentralized or even de-facto cenetralized, if mergers and acquisitions lead us there (and they probably will). And in order to be able to access our wallets/accounts from multiple devices, we’d use a convenient cloud service where we’d login with our username and password (because the private key is just too technical and hard for regular users). And that seems to defeat the whole idea.

Not only that, but there is an inevitable centralization of decisions (who decides on the size of the block, who has commit rights to the client repository) as well as a hidden centralization of power – how much GPU power does the Chinese mining “farms” control and can they influence the network significantly? And will the average user ever know that or care (as they don’t care that Google is centralized). I think that overall, distributed technologies will follow the power law, and the majority of data/processing power/decision power will be controller by a minority of actors. And so our distributed utopia will not happen in its purest form we dream of.

My third concern is incentive. Distributed technologies that have been successful so far have a pretty narrow set of incentives. The internet was promoted by large public institutions, including government agencies and big universitives. Bittorrent was successful mainly because it allowed free movies and songs with 2 clicks of the mouse. And Bitcoin was successful because it offered financial benefits. I’m oversimplifying of course, but “government effort”, “free & easy” and “source of more money” seem to have been the successful incentives. On the other side of the fence there are dozens of failed distributed technologies. I’ve tried many of them – alternative search engines, alternative file storage, alternative ride-sharings, alternative social networks, alternative “internets” even. None have gained traction. Because they are not easier to use than their free competitors and you can’t make money out of them (and no government bothers promoting them).

Will blockchain-based services have sufficient incentives to drive customers? Will centralized competitors just easily crush the distributed alternatives by being cheaper, more-user friendly, having sales departments that can target more than hardcore geeks who have no problem syncing their blockchain via the command line? The utopian slogans seem very cool to idealists and futurists, but don’t sell. “Free from centralized control, full control over your data” – we’d have to go through a long process of cultural change before these things make sense to more than a handful of people.

Speaking of services, often examples include “the sharing economy”, where one stranger offers a service to another stranger. Blockchain technology seems like a good fit here indeed – the services are by nature distributed, why should the technology be centralized? Here comes my fourth concern – identity. While for the cryptocurrencies it’s actually beneficial to be anonymous, for most of the real-world services (i.e. the industries that ought to be revolutionized) this is not an option. You can’t just go in the car of publicKey=5389BC989A342…. “But there are already distributed reputation systems”, you may say. Yes, and they are based on technical, not real-world identities. That doesn’t build trust. I don’t trust that publicKey=5389BC989A342… is the same person that got the high reputation. There may be five people behind that private key. The private key may have been stolen (e.g. in a cloud-provider breach).

The values of companies like Uber and AirBNB is that they serve as trust brokers. They verify and vouch for their drivers and hosts (and passengers and guests). They verify their identity through government-issued documents, skype calls, selfies, compare pictures to documents, get access to government databases, credit records, etc. Can a fully distributed service do that? No. You’d need a centralized provider to do it. And how would the blockchain make any difference then? Well, I may not be entirely correct here. I’ve actually been thinking quite a lot about decentralized identity. E.g. a way to predictably generate a private key based on, say biometrics+password+government-issued-documents, and use the corresponding public key as your identifier, which is then fed into reputation schemes and ultimately – real-world services. But we’re not there yet.

And that is part of my fifth concern – the technology itself. We are not there yet. There are bugs, there are thefts and leaks. There are hard-forks. There isn’t sufficient understanding of the technology (I confess I don’t fully grasp all the implementation details, and they are always the key). Often the technology is advertised as “just working”, but it isn’t. The other day I read an article (lost the link) that clarifies a common misconception about smart contracts – they cannot interact with the outside world – they can’t call APIs (e.g. stock market prices, bank APIs), they can’t push or fetch data from anywhere but the blockchain. That mandates the need, again, for a centralized service that pushes the relevant information before smart contracts can pick it up. I’m pretty sure that all cool-sounding applications are not possible without extensive research. And even if/when they are, writing distributed code is hard. Debugging a smart contract is hard. Yes, hard is cool, but that doesn’t drive economic value.

I have mostly been referring to public blockchains so far. Private blockchains may have their practical application, but there’s one catch – they are not exactly the cool distributed technology that the Bitcoin uses. They may be called “blockchains” because they…chain blocks, but they usually centralize trust. For example the Hyperledger project uses PKI, with all its benefits and risks. In these cases, a centralized authority issues the identity “tokens”, and then nodes communicate and form a shared ledger. That’s a bit easier problem to solve, and the nodes would usually be on actual servers in real datacenters, and not on your uncle’s Windows XP.

That said, hash chaining has been around for quite a long time. I did research on the matter because of a side-project of mine and it seems providing a tamper-proof/tamper-evident log/database on semi-trusted machines has been discussed in many computer science papers since the 90s. That alone is not “the magic blockchain” that will solve all of our problems, no matter what gossip protocols you sprinkle ontop. I’m not saying that’s bad, on the contrary – any variation and combinations of the building blocks of the blockchain (the hash chain, the consensus algorithm, the proof-of-work (or stake), possibly smart contracts), has potential for making useful products.

I know I sound like the a naysayer here, but I hope I’ve pointed out particular issues, rather than aimlessly ranting at the hype (though that’s tempting as well). I’m confident that blockchain-like technologies will have their practical applications, and we will see some successful, widely-adopted services and solutions based on that, just as pointed out in this detailed report. But I’m not convinced it will be revolutionizing.

I hope I’m proven wrong, though, because watching a revolutionizing technology closely and even being part of it would be quite cool.

The post Concerns About The Blockchain Technology appeared first on Bozho's tech blog.

CyberChef – Cyber Swiss Army Knife

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/SOhld_nebGs/

CyberChef is a simple, intuitive web app for carrying out all manner of “cyber” operations within a web browser. These operations include simple encoding like XOR or Base64, more complex encryption like AES, DES and Blowfish, creating binary and hexdumps, compression and decompression of data, calculating hashes and checksums, IPv6 and X.509…

Read the full post at darknet.org.uk

Avoiding TPM PCR fragility using Secure Boot

Post Syndicated from Matthew Garrett original http://mjg59.dreamwidth.org/48897.html

In measured boot, each component of the boot process is “measured” (ie, hashed and that hash recorded) in a register in the Trusted Platform Module (TPM) build into the system. The TPM has several different registers (Platform Configuration Registers, or PCRs) which are typically used for different purposes – for instance, PCR0 contains measurements of various system firmware components, PCR2 contains any option ROMs, PCR4 contains information about the partition table and the bootloader. The allocation of these is defined by the PC Client working group of the Trusted Computing Group. However, once the boot loader takes over, we’re outside the spec[1].

One important thing to note here is that the TPM doesn’t actually have any ability to directly interfere with the boot process. If you try to boot modified code on a system, the TPM will contain different measurements but boot will still succeed. What the TPM can do is refuse to hand over secrets unless the measurements are correct. This allows for configurations where your disk encryption key can be stored in the TPM and then handed over automatically if the measurements are unaltered. If anybody interferes with your boot process then the measurements will be different, the TPM will refuse to hand over the key, your disk will remain encrypted and whoever’s trying to compromise your machine will be sad.

The problem here is that a lot of things can affect the measurements. Upgrading your bootloader or kernel will do so. At that point if you reboot your disk fails to unlock and you become unhappy. To get around this your update system needs to notice that a new component is about to be installed, generate the new expected hashes and re-seal the secret to the TPM using the new hashes. If there are several different points in the update where this can happen, this can quite easily go wrong. And if it goes wrong, you’re back to being unhappy.

Is there a way to improve this? Surprisingly, the answer is “yes” and the people to thank are Microsoft. Appendix A of a basically entirely unrelated spec defines a mechanism for storing the UEFI Secure Boot policy and used keys in PCR 7 of the TPM. The idea here is that you trust your OS vendor (since otherwise they could just backdoor your system anyway), so anything signed by your OS vendor is acceptable. If someone tries to boot something signed by a different vendor then PCR 7 will be different. If someone disables secure boot, PCR 7 will be different. If you upgrade your bootloader or kernel, PCR 7 will be the same. This simplifies things significantly.

I’ve put together a (not well-tested) patchset for Shim that adds support for including Shim’s measurements in PCR 7. In conjunction with appropriate firmware, it should then be straightforward to seal secrets to PCR 7 and not worry about things breaking over system updates. This makes tying things like disk encryption keys to the TPM much more reasonable.

However, there’s still one pretty major problem, which is that the initramfs (ie, the component responsible for setting up the disk encryption in the first place) isn’t signed and isn’t included in PCR 7[2]. An attacker can simply modify it to stash any TPM-backed secrets or mount the encrypted filesystem and then drop to a root prompt. This, uh, reduces the utility of the entire exercise.

The simplest solution to this that I’ve come up with depends on how Linux implements initramfs files. In its simplest form, an initramfs is just a cpio archive. In its slightly more complicated form, it’s a compressed cpio archive. And in its peak form of evolution, it’s a series of compressed cpio archives concatenated together. As the kernel reads each one in turn, it extracts it over the previous ones. That means that any files in the final archive will overwrite files of the same name in previous archives.

My proposal is to generate a small initramfs whose sole job is to get secrets from the TPM and stash them in the kernel keyring, and then measure an additional value into PCR 7 in order to ensure that the secrets can’t be obtained again. Later disk encryption setup will then be able to set up dm-crypt using the secret already stored within the kernel. This small initramfs will be built into the signed kernel image, and the bootloader will be responsible for appending it to the end of any user-provided initramfs. This means that the TPM will only grant access to the secrets while trustworthy code is running – once the secret is in the kernel it will only be available for in-kernel use, and once PCR 7 has been modified the TPM won’t give it to anyone else. A similar approach for some kernel command-line arguments (the kernel, module-init-tools and systemd all interpret the kernel command line left-to-right, with later arguments overriding earlier ones) would make it possible to ensure that certain kernel configuration options (such as the iommu) weren’t overridable by an attacker.

There’s obviously a few things that have to be done here (standardise how to embed such an initramfs in the kernel image, ensure that luks knows how to use the kernel keyring, teach all relevant bootloaders how to handle these images), but overall this should make it practical to use PCR 7 as a mechanism for supporting TPM-backed disk encryption secrets on Linux without introducing a hug support burden in the process.

[1] The patchset I’ve posted to add measured boot support to Grub use PCRs 8 and 9 to measure various components during the boot process, but other bootloaders may have different policies.

[2] This is because most Linux systems generate the initramfs locally rather than shipping it pre-built. It may also get rebuilt on various userspace updates, even if the kernel hasn’t changed. Including it in PCR 7 would entirely break the fragility guarantees and defeat the point of all of this.

comment count unavailable comments

Basic API Rate-Limiting

Post Syndicated from Bozho original https://techblog.bozho.net/basic-api-rate-limiting/

It is likely that you are developing some form of (web/RESTful) API, and in case it is publicly-facing (or even when it’s internal), you normally want to rate-limit it somehow. That is, to limit the number of requests performed over a period of time, in order to save resources and protect from abuse.

This can probably be achieved on web-server/load balancer level with some clever configurations, but usually you want the rate limiter to be client-specific (i.e. each client of your API sohuld have a separate rate limit), and the way the client is identified varies. It’s probably still possible to do it on the load balancer, but I think it makes sense to have it on the application level.

I’ll use spring-mvc for the example, but any web framework has a good way to plug an interceptor.

So here’s an example of a spring-mvc interceptor:

@Component
public class RateLimitingInterceptor extends HandlerInterceptorAdapter {

    private static final Logger logger = LoggerFactory.getLogger(RateLimitingInterceptor.class);
    
    @Value("${rate.limit.enabled}")
    private boolean enabled;
    
    @Value("${rate.limit.hourly.limit}")
    private int hourlyLimit;

    private Map<String, Optional<SimpleRateLimiter>> limiters = new ConcurrentHashMap<>();
    
    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler)
            throws Exception {
        if (!enabled) {
            return true;
        }
        String clientId = request.getHeader("Client-Id");
        // let non-API requests pass
        if (clientId == null) {
            return true;
        }
        SimpleRateLimiter rateLimiter = getRateLimiter(clientId);
        boolean allowRequest = limiter.tryAcquire();
    
        if (!allowRequest) {
            response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
        }
        response.addHeader("X-RateLimit-Limit", String.valueOf(hourlyLimit));
        return allowRequest;
    }
    
    private SimpleRateLimiter getRateLimiter(String clientId) {
        return limiters.computeIfAbsent(clientId, clientId -> {
            return Optional.of(createRateLimiter(clientId));
        });
    }

	
    @PreDestroy
    public void destroy() {
        // loop and finalize all limiters
    }
}

This initializes rate-limiters per client on demand. Alternatively, on startup you could just loop through all registered API clients and create a rate limiter for each. In case the rate limiter doesn’t allow more requests (tryAcquire() returns false), then raturn “Too many requests” and abort the execution of the request (return “false” from the interceptor).

This sounds simple. But there are a few catches. You may wonder where the SimpleRateLimiter above is defined. We’ll get there, but first let’s see what options do we have for rate limiter implementations.

The most recommended one seems to be the guava RateLimiter. It has a straightforward factory method that gives you a rate limiter for a specified rate (permits per second). However, it doesn’t accomodate web APIs very well, as you can’t initilize the RateLimiter with pre-existing number of permits. That means a period of time should elapse before the limiter would allow requests. There’s another issue – if you have less than one permits per second (e.g. if your desired rate limit is “200 requests per hour”), you can pass a fraction (hourlyLimit / secondsInHour), but it still won’t work the way you expect it to, as internally there’s a “maxPermits” field that would cap the number of permits to much less than you want it to. Also, the rate limiter doesn’t allow bursts – you have exactly X permits per second, but you cannot spread them over a long period of time, e.g. have 5 requests in one second, and then no requests for the next few seconds. In fact, all of the above can be solved, but sadly, through hidden fields that you don’t have access to. Multiple feature requests exist for years now, but Guava just doesn’t update the rate limiter, making it much less applicable to API rate-limiting.

Using reflection, you can tweak the parameters and make the limiter work. However, it’s ugly, and it’s not guaranteed it will work as expected. I have shown here how to initialize a guava rate limiter with X permits per hour, with burstability and full initial permits. When I thought that would do, I saw that tryAcquire() has a synchronized(..) block. Will that mean all requests will wait for each other when simply checking whether allowed to make a request? That would be horrible.

So in fact the guava RateLimiter is not meant for (web) API rate-limiting. Maybe keeping it feature-poor is Guava’s way for discouraging people from misusing it?

That’s why I decided to implement something simple myself, based on a Java Semaphore. Here’s the naive implementation:

public class SimpleRateLimiter {
    private Semaphore semaphore;
    private int maxPermits;
    private TimeUnit timePeriod;
    private ScheduledExecutorService scheduler;

    public static SimpleRateLimiter create(int permits, TimeUnit timePeriod) {
        SimpleRateLimiter limiter = new SimpleRateLimiter(permits, timePeriod);
        limiter.schedulePermitReplenishment();
        return limiter;
    }

    private SimpleRateLimiter(int permits, TimeUnit timePeriod) {
        this.semaphore = new Semaphore(permits);
        this.maxPermits = permits;
        this.timePeriod = timePeriod;
    }

    public boolean tryAcquire() {
        return semaphore.tryAcquire();
    }

    public void stop() {
        scheduler.shutdownNow();
    }

    public void schedulePermitReplenishment() {
        scheduler = Executors.newScheduledThreadPool(1);
        scheduler.schedule(() -> {
            semaphore.release(maxPermits - semaphore.availablePermits());
        }, 1, timePeriod);

    }
}

It takes a number of permits (allowed number of requests) and a time period. The time period is “1 X”, where X can be second/minute/hour/daily – depending on how you want your limit to be configured – per second, per minute, hourly, daily. Every 1 X a scheduler replenishes the acquired permits (in the example above there’s one scheduler per client, which may be inefficient with large number of clients – you can pass a shared scheduler pool instead). There is no control for bursts (a client can spend all permits with a rapid succession of requests), there is no warm-up functionality, there is no gradual replenishment. Depending on what you want, this may not be ideal, but that’s just a basic rate limiter that is thread-safe and doesn’t have any blocking. I wrote a unit test to confirm that the limiter behaves properly, and also ran performance tests against a local application to make sure the limit is obeyed. So far it seems to be working.

Are there alternatives? Well, yes – there are libraries like RateLimitJ that uses Redis to implement rate-limiting. That would mean, however, that you need to setup and run Redis. Which seems like an overhead for “simply” having rate-limiting. (Note: it seems to also have an in-memory version)

On the other hand, how would rate-limiting work properly in a cluster of application nodes? Application nodes probably need some database or gossip protocol to share data about the per-client permits (requests) remaining? Not necessarily. A very simple approach to this issue would be to assume that the load balancer distributes the load equally among your nodes. That way you would just have to set the limit on each node to be equal to the total limit divided by the number of nodes. It won’t be exact, but you rarely need it to be – allowing 5-10 more requests won’t kill your application, allowing 5-10 less won’t be dramatic for the users.

That, however, would mean that you have to know the number of application nodes. If you employ auto-scaling (e.g. in AWS), the number of nodes may change depending on the load. If that is the case, instead of configuring a hard-coded number of permits, the replenishing scheduled job can calculate the “maxPermits” on the fly, by calling an AWS (or other cloud-provider) API to obtain the number of nodes in the current auto-scaling group. That would still be simpler than supporting a redis deployment just for that.

Overall, I’m surprised there isn’t a “canonical” way to implement rate-limiting (in Java). Maybe the need for rate-limiting is not as common as it may seem. Or it’s implemented manually – by temporarily banning API clients that use “too much resources”.

Update: someone pointed out the bucket4j project, which seems nice and worth taking a look at.

The post Basic API Rate-Limiting appeared first on Bozho's tech blog.

State Dept, MPAA, RIAA “Fake Twitter Feud” Plan Backfires

Post Syndicated from Andy original https://torrentfreak.com/state-dept-mpaa-riaa-fake-twitter-feud-plan-backfires-170706/

By the first quarter of 2017, Twitter had 328 million users. It’s the perfect platform to give anyone a voice online and when like-minded people act together to make something “trend”, stories and ideas can go viral.

When this happens organically, through sharing based on a genuine appreciation of topics and ideas, it can be an awe-inspiring thing to behold. However, the mechanism doesn’t have to be spontaneous to reach a large audience, if it’s organized properly.

That was the plan of the US State Department when it sent an email to Stanford Law School. With the Office of Intellectual Property Enforcement involved, the State Department’s Bureau of Economic Affairs asked the law school to participate in a “fake Twitter feud” to promote Intellectual Property protection.

Leaked by a Stanford law professor to Mike Masnick at Techdirt, the email outlines the aims of the looming online war.

“This summer, we want to activate an audience of young professionals – the kind of folks who are interested in foreign policy, but who aren’t aware that intellectual property protection touches every part of their lives. I think the law school students at your institution may be the type of community that we would like to engage,” the email reads.

“The Bureau of Economic and Business Affairs wants to start a fake Twitter feud. For this feud, we would like to invite you and other similar academic institutions to participate and throw in your own ideas!” the email reads.

The plan clearly has some momentum. According to the email, big names in IP protection are already on board, including the US Patent and Trademark Office, the powerful Copyright Alliance, not to mention the Motion Picture Association of America and the Recording Industry Association of America.

The above groups can call on thousands of individuals to get involved so participation could be significant. Helpfully, the email also suggests how the ‘conflict’ should play out, suggesting various topics and important figures to fire up the debate.

“The week after the 4th of July, when everyone gets back from vacation but will still feel patriotic and summery, we want to tweet an audacious statement like, ‘Bet you couldn’t see the Independence Day fireworks without bifocals; first American diplomat Ben Franklin invented them #bestIPmoment @StateDept’,” the email reads.

As one of the Founding Fathers of the United States, Benjamin Franklin is indeed one of the most important figures in US history. And, as the inventor of not only bifocals, the lighting rod, and myriad other useful devices, his contribution to science and society is unquestionable.

Attaching him to this campaign, however, is a huge faux pas.

Despite inventing swim fins, the Franklin stove, the flexible catheter, a 24-hour three-wheel clock, a long-arm device to reach books from a high shelf, and becoming the first person to use the words “positive” and “negative” to describe electricity,
Franklin refused to patent any of his inventions.

“As we benefit from the inventions of others, we should be glad to share our own…freely and gladly,” he wrote in his autobiography.

It’s abundantly clear that using Franklin as the seed for an IP protection campaign is problematic, to say the least. His inventions have enriched the lives of millions due to his kindness and desire to share.

Who knows what might have happened if patents for bifocals and lightning rods had been aggressively enforced. Certainly, the groups already committed to this campaign wouldn’t have given up such valuable Intellectual Property so easily.

To be fair to the Bureau of Economic and Business Affairs, the decision to use the term “fake Twitter feud” seems more misguided than malicious and it seems unlikely that any conflict could have broken out when all participants are saying the same thing.

That being said, with the Copyright Alliance, MPAA and RIAA on board, the complexion changes somewhat. All three have an extremely tough stance on IP enforcement so will have a key interest in influencing how the “feud” develops and who gets sucked in.

The big question now, however, is if this campaign will now go ahead as laid out in the email. The suggested hashtags (#MostAmericanIP and #BestIPMoment) have little traction so far and now everyone will know that far from being a spontaneous event, the whole thing will have been coordinated. That probably isn’t the best look.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Get social: connecting with Raspberry Pi

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/connecting-raspberry-pi-social/

Fancy connecting with Raspberry Pi beyond the four imaginary walls of this blog post? Want to find ways into the conversation among our community of makers, learners, and educators? Here’s how:

Twitter

Connecting with us on Twitter is your sure-fire way of receiving the latest news and articles from and about the Raspberry Pi Foundation, Code Club, and CoderDojo. Here you’ll experience the fun, often GIF-fuelled banter of the busy Raspberry Pi community, along with tips, project support, and event updates. This is the best place to follow hashtags such as #Picademy, #MakeYourIdeas, and #RJam in real time.

Raspberry Pi on Twitter

News! Raspberry Pi and @CoderDojo join forces in a merger that will help more young people get creative with tech: https://t.co/37y45ht7li

YouTube

We create a variety of video content, from Pi Towers fun, to resource videos, to interviews and program updates. We’re constantly adding content to our channel to bring you more interesting, enjoyable videos to watch and share within the community. Want to see what happens when you drill a hole through a Raspberry Pi Zero to make a fidget spinner? Or what Code Club International volunteers got up to when we brought them together in London for a catch-up? Maybe you’d like to try a new skill and need guidance? Our YouTube channel is the place to go!

Getting started with soldering

Learn the basics of how to solder components together, and the safety precautions you need to take. Find a transcript of this video in our accompanying learning resource: raspberrypi.org/learning/getting-started-with-soldering/

Instagram

Instagram is known as the home of gorgeous projects and even better-looking project photographs. Our Instagram, however, is mainly a collection of random office shenanigans, short video clips, and the occasional behind-the-scenes snap of projects, events, or the mess on my desk. Come join the party!

When one #AstroPi unit is simply not enough… . Would you like to #3DPrint your own Astro Pi unit? Head to rpf.io/astroprint for the free files and assembly guide . . . . . . #RaspberryPi #Space #ESA @astro_timpeake @thom_astro

1,379 Likes, 9 Comments – Raspberry Pi (@raspberrypifoundation) on Instagram: “When one #AstroPi unit is simply not enough… . Would you like to #3DPrint your own Astro Pi unit?…”

Facebook

Looking to share information on Raspberry Pi with your social community? Maybe catch a live stream or read back through comments on some of our community projects? Then you’ll want to check out Raspberry Pi Facebook page. It brings the world together via a vast collection of interesting articles, images, videos, and discussions. Here you’ll find information on upcoming events we’re visiting, links to our other social media accounts, and projects our community shares via visitor posts. If you have a moment to spare, you may even find you can answer a community question.

Raspberry Pi at the Scottish Learning Festival

No Description

Raspberry Pi forum

The Raspberry Pi forum is the go-to site for posting questions, getting support, and just having a good old chin wag. Whether you have problems setting up your Pi, need advice on how to build a media centre, or can’t figure out how to utilise Scratch for the classroom, the forum has you covered. Head there for absolutely anything Pi-related, and you’re sure to find help with your query – or better yet, the answer may already be waiting for you!

G+

Our G+ community is an ever-growing mix of makers, educators, industry professionals, and those completely new to Pi and eager to learn more about the Foundation and the product. Here you’ll find project shares, tech questions, and conversation. It’s worth stopping by if you use the platform.

Code Club and CoderDojo

You should also check out the social media accounts of our BFFs Code Club and CoderDojo!


On the CoderDojo website, along with their active forum, you’ll find links to all their accounts at the bottom of the page. For UK-focused Code Club information, head to the Code Club UK Twitter account, and for links to accounts of Code Clubs based in your country, use the search option on the Code Club International website.

Connect with us

However you want to connect with us, make sure to say hi. We love how active and welcoming our online community is and we always enjoy engaging in conversation, seeing your builds and events, and sharing Pi Towers mischief as well as useful Pi-related information and links with you!

If you use any other social platform and miss our presence there, let us know in the comments. And if you run your own Raspberry Pi-related forum, online group, or discussion board, share that as well!

The post Get social: connecting with Raspberry Pi appeared first on Raspberry Pi.

Analysis of Top-N DynamoDB Objects using Amazon Athena and Amazon QuickSight

Post Syndicated from Rendy Oka original https://aws.amazon.com/blogs/big-data/analysis-of-top-n-dynamodb-objects-using-amazon-athena-and-amazon-quicksight/

If you run an operation that continuously generates a large amount of data, you may want to know what kind of data is being inserted by your application. The ability to analyze data intake quickly can be very valuable for business units, such as operations and marketing. For many operations, it’s important to see what is driving the business at any particular moment. For retail companies, for example, understanding which products are currently popular can aid in planning for future growth. Similarly, for PR companies, understanding the impact of an advertising campaign can help them market their products more effectively.

This post covers an architecture that helps you analyze your streaming data. You’ll build a solution using Amazon DynamoDB Streams, AWS Lambda, Amazon Kinesis Firehose, and Amazon Athena to analyze data intake at a frequency that you choose. And because this is a serverless architecture, you can use all of the services here without the need to provision or manage servers.

The data source

You’ll collect a random sampling of tweets via Twitter’s API and store a variety of attributes in your DynamoDB table, such as: Twitter handle, tweet ID, hashtags, location, and Time-To-Live (TTL) value.

In DynamoDB, the primary key is used as an input to an internal hash function. The output from this function determines the partition in which the data will be stored. When using a combination of primary key and sort key as a DynamoDB schema, you need to make sure that no single partition key contains many more objects than the other partition keys because this can cause partition level throttling. For the demonstration in this blog, the Twitter handle will be the primary key and the tweet ID will be the sort key. This allows you to group and sort tweets from each user.

To help you get started, I have written a script that pulls a live Twitter stream that you can use to generate your data. All you need to do is provide your own Twitter Apps credentials, and it should generate the data immediately. Alternatively, I have also provided a script that you can use to generate random Tweets with little effort.

You can find both scripts in the Github repository:

https://github.com/awslabs/aws-blog-dynamodb-analysis

There are some modules that you may need to install to run these scripts. You can find them in Python’s module repository:

To get your own Twitter credentials, go to https://www.twitter.com/ and sign up for a free account, if you don’t already have one. After your account is set up, go to https://apps.twitter.com/. On the main landing page, choose the Create New App button. After the application is created, go to Keys and Access Tokens to get your credentials to use the Twitter API. You’ll need to generate Customer Tokens/Secret and Access Token/Secret. All four keys will be used to authenticate your request.

Architecture overview

Before we begin, let’s take a look at the overall flow of information will look like, from data ingestion into DynamoDB to visualization of results in Amazon QuickSight.

As illustrated in the architecture diagram above, any changes made to the items in DynamoDB will be captured and processed using DynamoDB Streams. Next, a Lambda function will be invoked by a trigger that is configured to respond to events in DynamoDB Streams. The Lambda function processes the data prior to pushing to Amazon Kinesis Firehose, which will output to Amazon S3. Finally, you use Amazon Athena to analyze the streaming data landing in Amazon S3. The result can be explored and visualized in Amazon QuickSight for your company’s business analytics.

You’ll need to implement your custom Lambda function to help transform the raw <key, value> data stored in DynamoDB to a JSON format for Athena to digest, but I can help you with a sample code that you are free to modify.

Implementation

In the following sections, I’ll walk through how you can set up the architecture discussed earlier.

Create your DynamoDB table

First, let’s create a DynamoDB table and enable DynamoDB Streams. This will enable data to be copied out of this table. From the console, use the user_id as the partition key and tweet_id as the sort key:

After the table is ready, you can enable DynamoDB Streams. This process operates asynchronously, so there is no performance impact on the table when you enable this feature. The easiest way to manage DynamoDB Streams is also through the DynamoDB console.

In the Overview tab of your newly created table, click Manage Stream. In the window, choose the information that will be written to the stream whenever data in the table is added or modified. In this example, you can choose either New image or New and old images.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

Configure Kinesis Firehose

Before creating the Lambda function, you need to configure Kinesis Firehose delivery stream so that it’s ready to accept data from Lambda. Open the Firehose console and choose Create Firehose Delivery Stream. From here, choose S3 as the destination and use the following to information to configure the resource. Note the Delivery stream name because you will use it in the next step.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/firehose/latest/dev/basic-create.html#console-to-s3

Create your Lambda function

Now that Kinesis Firehose is ready to accept data, you can create your Lambda function.

From the AWS Lambda console, choose the Create a Lambda function button and use the Blank Function. Enter a name and description, and choose Python 2.7 as the Runtime. Note your Lambda function name because you’ll need it in the next step.

In the Lambda function code field, you can paste the script that I have written for this purpose. All this function needs is the name of your Firehose stream name set as an environment variable.

import boto3
import json
import os

# Initiate Firehose client
firehose_client = boto3.client('firehose')

def lambda_handler(event, context):
    records = []
    batch   = []
    try :
        for record in event['Records']:
            tweet = {}
            t_stats = '{ "table_name":"%s", "user_id":"%s", "tweet_id":"%s", "approx_post_time":"%d" }\n' \
                      % ( record['eventSourceARN'].split('/')[1], \
                          record['dynamodb']['Keys']['user_id']['S'], \
                          record['dynamodb']['Keys']['tweet_id']['N'], \
                          int(record['dynamodb']['ApproximateCreationDateTime']) )
            tweet["Data"] = t_stats
            records.append(tweet)
        batch.append(records)
        res = firehose_client.put_record_batch(
            DeliveryStreamName = os.environ['firehose_stream_name'],
            Records = batch[0]
        )
        return 'Successfully processed {} records.'.format(len(event['Records']))
    except Exception :
        pass

The handler should be set to lambda_function.lambda_handler and you can use the existing lambda_dynamodb_streams role that’s been created by default.

Enable DynamoDB trigger and start collecting data

Everything is ready to go. Open your table using the DynamoDB console and go to the Triggers tab. Select the Create trigger drop down list and choose Existing Lambda function. In the pop-up window, select the function that you just created, and choose the Create button.

At this point, you can start collecting data with the Python script that I’ve provided. The first one will create a script that will pull public Twitter data and the other will generate fake tweets using Lorem Ipsum text.

Configure Amazon Athena to read the data

Next, you will configure Amazon Athena so that it can read the data Kinesis Firehose outputs to Amazon S3 and allow you to analyze the data as needed. You can connect to Athena directly from the Athena console, and you can establish a connection using JDBC or the Athena API. In this example, I’m going to demonstrate what this looks like on the Athena console.

First, create a new database and a new table. You can do this by running the following two queries. The first query creates a new database:

CREATE DATABASE IF NOT EXISTS ddbtablestats

And the second query creates a new table:

CREATE EXTERNAL TABLE IF NOT EXISTS ddbtablestats.twitterfeed (
    `table_name` string,
    `user_id` string,
    `tweet_id` bigint,
    `approx_post_time` timestamp 
) PARTITIONED BY (
    year string,
    month string,
    day string,
    hour string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
LOCATION 's3://myBucket/dynamodb/streams/transactions/'

Note that this table is created using partitions. Partitioning separates your data into logical parts based on certain criteria, such as date, location, language, etc. This allows Athena to selectively pull your data without needing to process the entire data set. This effectively minimizes the query execution time, and it also allows you to have greater control over the data that you want to query.

After the query has completed, you should be able to see the table in the left side pane of the Athena dashboard.

After the database and table have been created, execute the ALTER TABLE query to populate the partitions in your table. Replace the date with the current date when the script was executed.

ALTER TABLE ddbtablestats.TwitterFeed ADD IF NOT EXISTS
PARTITION (year='2017',month='05',day='17',hour='01') location 's3://myBucket/dynamodb/streams/transactions/2017/05/17/01/'

Using the Athena console, you’ll need to manually populate each partition for each additional partition that you’d like to analyze, however you can programmatically automate this process by using the JDBC driver or any AWS SDK of your choice.

For more information on partitioning in Athena, check out our documentation:

http://docs.aws.amazon.com/athena/latest/ug/partitions.html

Querying the data in Amazon Athena

This is it! Let’s run this query to see the top 10 most active Twitter users in the last 24 hours. You can do this from the Athena console:

SELECT user_id, COUNT(DISTINCT tweet_id) tweets FROM ddbTableStats.TwitterFeed
WHERE year='2017' AND month='05' AND day='17'
GROUP BY user_id
ORDER BY tweets DESC
LIMIT 10

The result should look similar to the following:

Linking Athena to Amazon QuickSight

Finally, to make this data available to a larger audience, let’s visualize this data in Amazon QuickSight. Amazon QuickSight provides native connectivity to AWS data sources such as Amazon Redshift, Amazon RDS, and Amazon Athena. Amazon QuickSight can also connect to on-premises databases, Excel, or CSV files, and it can connect to cloud data sources such as Salesforce.com. For this solution, we will connect Amazon QuickSight to the Athena table we just created.

Amazon QuickSight has a free tier that provides 1 user and 1GB of SPICE (Superfast Parallel In-memory Calculated Engine) capacity free. So you can sign up and use QuickSight free of charge.

When you are signing up for Amazon QuickSight, ensure that you grant permissions for QuickSight to connect to Athena and the S3 bucket where the data is stored.

After you’ve signed up, navigate to the new analysis button, and choose new data set, and then select the Athena data source option. Create a new name for your data source and proceed to the next prompt. At this point, you should see the Athena table you created earlier.

Choose the option to import the data to SPICE for a quicker analysis. SPICE is an in-memory optimized calculation engine that is designed for quick data visualization through parallel processing. SPICE also enables you to refresh your data sets at a regular interval or on-demand as you want.

In the dialog box, confirm this data set creation, and you’ll arrive on the landing page where you can start building your graph. The X-axis will represent the user_id and the Value will be used to represent the SUM total of the tweets from each user.

The Amazon QuickSight report looks like this:

Through this visualization, I can easily see that there are 3 users that tweeted over 20 times that day and that the majority of the users have fewer than 10 tweets that day. I can also set up a scheduled refresh of my SPICE dataset so that I have a dashboard that is regularly updated with the latest data.

Closing thoughts

Here are the benefits that you can gain from using this architecture:

  1. You can optimize the design of your DynamoDB schema that follows AWS best practice recommendations.
  1. You can run analysis and data intelligence in order to understand the current customer demands for your business.
  1. You can store incremental backup for future auditing.

The flexibility of our AWS services invites you to create and design the ideal workflow for your production at any scale, and, as always, if you ever need some guidance, don’t hesitate to reach out to us.I  hope this has been helpful to you! Please leave any questions and comments below.

 


Additional Reading

Learn how to analyze VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight.


About the Author

Rendy Oka is a Big Data Support Engineer for Amazon Web Services. He provides consultations and architectural designs and partners with the TAMs, Solution Architects, and AWS product teams to help develop solutions for our customers. He is also a team lead for the big data support team in Seattle. Rendy has traveled to dozens of countries around the world and takes every opportunity to experience the local culture wherever he goes

 

 

 

 

NonPetya: no evidence it was a "smokescreen"

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/06/nonpetya-no-evidence-it-was-smokescreen.html

Many well-regarded experts claim that the not-Petya ransomware wasn’t “ransomware” at all, but a “wiper” whose goal was to destroy files, without any intent at letting victims recover their files. I want to point out that there is no real evidence of this.

Certainly, things look suspicious. For one thing, it certainly targeted the Ukraine. For another thing, it made several mistakes that prevent them from ever decrypting drives. Their email account was shutdown, and it corrupts the boot sector.

But these things aren’t evidence, they are problems. They are things needing explanation, not things that support our preferred conspiracy theory.

The simplest, Occam’s Razor explanation explanation is that they were simple mistakes. Such mistakes are common among ransomware. We think of virus writers as professional software developers who thoroughly test their code. Decades of evidence show the opposite, that such software is of poor quality with shockingly bad bugs.

It’s true that effectively, nPetya is a wiper. Matthieu Suiche‏ does a great job describing one flaw that prevents it working. @hasherezade does a great job explaining another flaw.  But best explanation isn’t that this is intentional. Even if these bugs didn’t exist, it’d still be a wiper if the perpetrators simply ignored the decryption requests. They need not intentionally make the decryption fail.

Thus, the simpler explanation is that it’s simply a bug. Ransomware authors test the bits they care about, and test less well the bits they don’t. It’s quite plausible to believe that just before shipping the code, they’d add a few extra features, and forget to regression test the entire suite. I mean, I do that all the time with my code.

Some have pointed to the sophistication of the code as proof that such simple errors are unlikely. This isn’t true. While it’s more sophisticated than WannaCry, it’s about average for the current state-of-the-art for ransomware in general. What people think of, such the Petya base, or using PsExec to spread throughout a Windows domain, is already at least a year old.

Indeed, the use of PsExec itself is a bit clumsy, when the code for doing the same thing is already public. It’s just a few calls to basic Windows networking APIs. A sophisticated virus would do this itself, rather than clumsily use PsExec.

Infamy doesn’t mean skill. People keep making the mistake that the more widespread something is in the news, the more skill, the more of a “conspiracy” there must be behind it. This is not true. Virus/worm writers often do newsworthy things by accident. Indeed, the history of worms, starting with the Morris Worm, has been things running out of control more than the author’s expectations.

What makes nPetya newsworthy isn’t the EternalBlue exploit or the wiper feature. Instead, the creators got lucky with MeDoc. The software is used by every major organization in the Ukraine, and at the same time, their website was horribly insecure — laughably insecure. Furthermore, it’s autoupdate feature didn’t check cryptographic signatures. No hacker can plan for this level of widespread incompetence — it’s just extreme luck.

Thus, the effect of bumbling around is something that hit the Ukraine pretty hard, but it’s not necessarily the intent of the creators. It’s like how the Slammer worm hit South Korea pretty hard, or how the Witty worm hit the DoD pretty hard. These things look “targeted”, especially to the victims, but it was by pure chance (provably so, in the case of Witty).

Certainly, MeDoc was targeted. But then, targeting a single organization is the norm for ransomware. They have to do it that way, giving each target a different Bitcoin address for payment. That it then spread to the entire Ukraine, and further, is the sort of thing that typically surprises worm writers.

Finally, there’s little reason to believe that there needs to be a “smokescreen”. Russian hackers are targeting the Ukraine all the time. Whether Russian hackers are to blame for “ransomware” vs. “wiper” makes little difference.

Conclusion

We know that Russian hackers are constantly targeting the Ukraine. Therefore, the theory that this was nPetya’s goal all along, to destroy Ukraines computers, is a good one.

Yet, there’s no actual “evidence” of this. nPetya’s issues are just as easily explained by normal software bugs. The smokescreen isn’t needed. The boot record bug isn’t needed. The single email address that was shutdown isn’t significant, since half of all ransomware uses the same technique.

The experts who disagree with me are really smart/experienced people who you should generally trust. It’s just that I can’t see their evidence.

Update: I wrote another blogpost about “survivorship bias“, refuting the claim by many experts talking about the sophistication of the spreading feature.


Update: comment asks “why is there no Internet spreading code?”. The answer is “I don’t know”, but unanswerable questions aren’t evidence of a conspiracy. “What aren’t there any stars in the background?” isn’t proof the moon landings are fake, such because you can’t answer the question. One guess is that you never want ransomware to spread that far, until you’ve figured out how to get payment from so many people.

mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html

Introducing mkosi

After blogging about
casync
I realized I never blogged about the
mkosi tool that combines nicely
with it. mkosi has been around for a while already, and its time to
make it a bit better known. mkosi stands for Make Operating System
Image
, and is a tool for precisely that: generating an OS tree or
image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite
well known and popular. But mkosi has a number of features that I
think make it interesting for a variety of use-cases that other tools
don’t cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart?
mkosi is definitely a tool with a focus on developer’s needs for
building OS images, for testing and debugging, but also for generating
production images with cryptographic protection. A typical use-case
would be to add a mkosi.default file to an existing project (for
example, one written in C or Python), and thus making it easy to
generate an OS image for it. mkosi will put together the image with
development headers and tools, compile your code in it, run your test
suite, then throw away the image again, and build a new one, this time
without development headers and tools, and install your build
artifacts in it. This final image is then “production-ready”, and only
contains your built program and the minimal set of packages you
configured otherwise. Such an image could then be deployed with
casync (or any other tool of course) to be delivered to your set of
servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on
today’s technology, not yesteryear’s. Specifically this means that
we’ll generate GPT partition tables, not MBR/DOS ones. When you tell
mkosi to generate a bootable image for you, it will make it bootable
on EFI, not on legacy BIOS. The GPT images generated follow
specifications such as the Discoverable Partitions
Specification
,
so that /etc/fstab can remain unpopulated and tools such as
systemd-nspawn can automatically dissect the image and boot from
them.

So, let’s have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional
options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build
images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same
feature level currently. Also, as mkosi is based on dnf
--installroot
, debootstrap, pacstrap and zypper, and those
packages are not packaged universally on all distributions, you might
not be able to build images for all those distributions on arbitrary
host distributions. For example, Fedora doesn’t package zypper,
hence you cannot build an openSUSE image easily on Fedora, but you can
still build Fedora (obviously…), Debian, Ubuntu and ArchLinux images
on it just fine.

The GPT images are put together in a way that they aren’t just
compatible with UEFI systems, but also with VM and container managers
(that is, at least the smart ones, i.e. VM managers that know UEFI,
and container managers that grok GPT disk images) to a large
degree. In fact, the idea is that you can use mkosi to build a
single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd’s RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used
if available to ensure the image is not tempered with (yes, you read
that right, systemd-nspawn and systemd’s RootImage= setting
automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without
parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you,
will call it image.raw and drop it in the current directory. The
distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is
put together, i.e. select package sets, select the distribution, size
partitions and so on. Most of that you can actually specify on the
command line, but it is recommended to instead create a couple of
mkosi.$SOMETHING files and directories in some directory. Then,
simply change to that directory and run mkosi without any further
arguments. The tool will then look in the current working directory
for these files and directories and make use of them (similar to how
make looks for a Makefile…). Every single file/directory is
optional, but if they exist they are honored. Here’s a list of the
files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you
    can configure what kind of image you want, which distribution, which
    packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy
    everything inside it into the images built. You can place arbitrary
    directory hierarchies in here, and they’ll be copied over whatever is
    already in the image, after it was put together by the distribution’s
    package manager. This is the best way to drop additional static files
    into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build
    script. When it exists, mkosi will build two images, one after the
    other in the mode already mentioned above: the first version is the
    build image, and may include various build-time dependencies such as
    a compiler or development headers. The build script is also copied
    into it, and then run inside it. The script should then build
    whatever shall be built and place the result in $DESTDIR (don’t
    worry, popular build tools such as Automake or Meson all honor
    $DESTDIR anyway, so there’s not much to do here explicitly). It may
    also run a test suite, or anything else you like. After the script
    finished, the build image is removed again, and a second image (the
    final image) is built. This time, no development packages are
    included, and the build script is not copied into the image again —
    however, the build artifacts from the first run (i.e. those placed in
    $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked
    inside the image (inside a systemd-nspawn invocation) and can
    adjust the image as it likes at a very late point in the image
    preparation. If mkosi.build exists, i.e. the dual-phased
    development build process used, then this script will be invoked
    twice: once inside the build image and once inside the final
    image. The first parameter passed to the script clarifies which phase
    it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a
    container configuration file for systemd-nspawn (see
    systemd.nspawn(5)
    for details), which shall be shipped along with the final image and
    shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package
    cache directory for the builds. This directory is effectively bind
    mounted into the image at build time, in order to speed up building
    images. The package installers of the various distributions will
    place their package files here, so that subsequent runs can reuse
    them.

  7. mkosi.passphrase — If this file exists, it should contain a
    pass-phrase to use for the LUKS encryption (if that’s enabled for the
    image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an
    X.509 key pair to use for signing the kernel and initrd for UEFI
    SecureBoot, if that’s enabled.

How to use it

So, let’s come back to our most trivial example, without any of the
mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current
directory. How do we use it? Of course, we could dd it onto some USB
stick and boot it on a bare-metal device. However, it’s much simpler
to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let’s make things more interesting. Let’s still not use any of
the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it’s no
longer GPT + ext4, but GPT + btrfs. Moreover, the system is made
bootable on UEFI systems, and finally, the output is now called
foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except
that this uses full VM virtualization rather than container
virtualization. (Note that the way to run a UEFI qemu/kvm instance
appears to change all the time and is different on the various
distributions. It’s quite annoying, and I can’t really tell you what
the right qemu command line is to make this work on your system.)

Of course, it’s not all raw GPT disk images with mkosi. Let’s try
a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can’t boot
it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask
mkosi to generate a compressed GPT image with a root squashfs,
compress the result with xz, and generate a SHA256SUMS file with
the hashes of the generated artifacts. The package will contain the
SSH client as well as everybody’s favorite editor.

Now, let’s make use of the various mkosi.$SOMETHING files. Let’s
say we are working on some Automake-based project and want to make it
easy to generate a disk image off the development tree with the
version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let’s add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
cd $SRCDIR
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let’s try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you’ll notice that building an image like
this can be quite slow. And slow build times are actively hurtful to
your productivity as a developer. Hence let’s make things a bit
faster. First, let’s make use of a package cache shared between runs:

# mkdir mkosi.chache

Building images now should already be substantially faster (and
generate less network traffic) as the packages will now be downloaded
only once and reused. However, you’ll notice that unpacking all those
packages and the rest of the work is still quite slow. But mkosi can
help you with that. Simply use mkosi‘s incremental build feature. In
this mode mkosi will make a copy of the build and final images
immediately before dropping in your build sources or artifacts, so
that building an image becomes a lot quicker: instead of always
starting totally from scratch a build will now reuse everything it can
reuse from a previous run, and immediately begin with building your
sources rather than the build image to build your sources in. To
enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated
anymore from your distribution’s servers, as the cached copy is made
after all packages are installed, and hence until you actually delete
the cached copy the distribution’s network servers aren’t contacted
again and no RPMs or DEBs are downloaded. This means the distribution
you use becomes “frozen in time” this way. (Which might be a bad
thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you’ll notice that it
won’t overwrite the generated image when it already exists. You can
either delete the file yourself first (rm image.raw) or let mkosi
do it for you right before building a new image, with mkosi -f. You
can also tell mkosi to not only remove any such pre-existing images,
but also remove any cached copies of the incremental feature, by using
-f twice.

I wrote mkosi originally in order to test systemd, and quickly
generate a disk image of various distributions with the most current
systemd version from git, without all that affecting my host system. I
regularly use mkosi for that today, in incremental mode. The two
commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the
very newest set of RPMs provided by Fedora, instead of a cached
snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git
tree:
mkosi.default
and
mkosi.build. This
way, any developer who wants to quickly test something with current
systemd git, or wants to prepare a patch based on it and test it can
check out the systemd repository and simply run mkosi in it and a
few minutes later he has a bootable image he can test in
systemd-nspawn or KVM. casync has similar files:
mkosi.default,
mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled
    disk images if you ask for it. For that use the --verity switch on
    the command line or Verity= setting in mkosi.default. Of course,
    dm-verity implies that the root volume is read-only. In this mode
    the top-level dm-verity hash will be placed along-side the output
    disk image in a file named the same way, but with the .roothash
    suffix. If the image is to be created bootable, the root hash is also
    included on the kernel command line in the roothash= parameter,
    which current systemd versions can use to both find and activate the
    root partition in a dm-verity protected way. BTW: it’s a good idea
    to combine this dm-verity mode with the raw_squashfs image mode,
    to generate a genuinely protected, compressed image suitable for
    running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum
    file SHA256SUMS for you (--checksum) covering all the files it
    outputs (which could be the image file itself, a matching .nspawn
    file using the mkosi.nspawn file mentioned above, as well as the
    .roothash file for the dm-verity root hash.) It can then
    optionally sign this with gpg (--sign). Note that systemd‘s
    machinectl pull-tar and machinectl pull-raw command can download
    these files and the SHA256SUMS file automatically and verify things
    on download. With other words: what mkosi outputs is perfectly
    ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To
    make use of that, place your X.509 key pair in two files
    mkosi.secureboot.crt and mkosi.secureboot.key, and set
    SecureBoot= or --secure-boot. If so, mkosi will sign the
    kernel/initrd/kernel command line combination during the build. Of
    course, if you use this mode, you should also use
    Verity=/--verity=, otherwise the setup makes only partial
    sense. Note that mkosi will not help you with actually enrolling
    the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes
    it is run in a git checkout and you use the mkosi.build script
    stuff, the source tree will be copied into the build image, but will
    all files excluded by .gitignore removed.

  5. There’s support for encryption in place. Use --encrypt= or
    Encrypt=. Note that the UEFI ESP is never encrypted though, and the
    root partition only if explicitly requested. The /home and /srv
    partitions are unconditionally encrypted if that’s enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line
    arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies,
listed in the
README. Most
notably you need a somewhat recent systemd version to make use of its
full feature set: systemd 233. Older versions are already packaged for
various distributions, but much of what I describe above is only
available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn’t
available in Fedora, but there’s a
COPR
.

Future

It is my intention to continue turning mkosi into a tool suitable
for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and
systemd/sd-boot native support for A/B IoT style partition
setups. The idea is that the combination of systemd, casync and
mkosi provides generic building blocks for building secure,
auto-updating devices in a generic way from, even though all pieces
may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like
    $SOMEOTHERPROJECT!
    — Well, to my knowledge there’s no tool that
    integrates this nicely with your project’s development tree, and can
    do dm-verity and UEFI SecureBoot and all that stuff for you. So
    nope, I don’t think this exactly like $SOMEOTHERPROJECT, thank you
    very much.

  2. What about creating MBR/DOS partition images? — That’s really
    out of focus to me. This is an exercise in figuring out how generic
    OSes and devices in the future should be built and an attempt to
    commoditize OS image building. And no, the future doesn’t speak MBR,
    sorry. That said, I’d be quite interested in adding support for
    booting on Raspberry Pi, possibly using a hybrid approach, i.e. using
    a GPT disk label, but arranging things in a way that the Raspberry Pi
    boot protocol (which is built around DOS partition tables), can still
    work.

  3. Is this portable? — Well, depends what you mean by
    portable. No, this tool runs on Linux only, and as it uses
    systemd-nspawn during the build process it doesn’t run on
    non-systemd systems either. But then again, you should be able to
    create images for any architecture you like with it, but of course if
    you want the image bootable on bare-metal systems only systems doing
    UEFI are supported (but systemd-nspawn should still work fine on
    them).

  4. Where can I get this stuff? — Try
    GitHub. And some distributions
    carry packaged versions, but I think none of them the current v3
    yet.

  5. Is this a systemd project? — Yes, it’s hosted under the
    systemd GitHub umbrella. And yes,
    during run-time systemd-nspawn in a current version is required. But
    no, the code-bases are separate otherwise, already because systemd
    is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no?
    Yes, but the feature we need kind of matters (systemd-nspawn‘s
    --overlay= switch), and again, this isn’t supposed to be a tool for
    legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am
    not an LXC nor Docker guy. If you select directory or subvolume
    as image type, LXC should be able to boot the generated images just
    fine, but I didn’t try. Last time I looked, Docker doesn’t permit
    running proper init systems as PID 1 inside the container, as they
    define their own run-time without intention to emulate a proper
    system. Hence, no I don’t think it will work, at least not with an
    unpatched Docker version. That said, again, don’t ask me questions
    about Docker, it’s not precisely my area of expertise, and quite
    frankly I am not a fan. To my knowledge neither LXC nor Docker are
    able to run containers directly off GPT disk images, hence the
    various raw_xyz image types are definitely not compatible with
    either. That means if you want to generate a single raw disk image
    that can be booted unmodified both in a container and on bare-metal,
    then systemd-nspawn is the container manager to go for
    (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that’s up to you really.

If you hack on some complex project and need a quick way to compile
and run your project on a specific current Linux distribution, then
mkosi is an excellent way to do that. Simply drop the mkosi.default
and mkosi.build files in your git tree and everything will be
easy. (And of course, as indicated above: if the project you are
hacking on happens to be called systemd or casync be aware that
those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great
choice too, as it will make it reasonably easy to generate secure
images that are protected against offline modification, by using
dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a
VM or systemd-nspawn container, or a portable service then mkosi
is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd
init systems, old VM managers, Docker, … then no, mkosi is not for
you, but there are plenty of well-established alternatives around that
cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to
accept your patches and other contributions.

Oh, and one unrelated last thing: don’t forget to submit your talk
proposal

and/or buy a ticket for
All Systems Go! 2017 in Berlin — the
conference where things like systemd, casync and mkosi are
discussed, along with a variety of other Linux userspace projects used
for building systems.

mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html

Introducing mkosi

After blogging about
casync
I realized I never blogged about the
mkosi tool that combines nicely
with it. mkosi has been around for a while already, and its time to
make it a bit better known. mkosi stands for Make Operating System
Image
, and is a tool for precisely that: generating an OS tree or
image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite
well known and popular. But mkosi has a number of features that I
think make it interesting for a variety of use-cases that other tools
don’t cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart?
mkosi is definitely a tool with a focus on developer’s needs for
building OS images, for testing and debugging, but also for generating
production images with cryptographic protection. A typical use-case
would be to add a mkosi.default file to an existing project (for
example, one written in C or Python), and thus making it easy to
generate an OS image for it. mkosi will put together the image with
development headers and tools, compile your code in it, run your test
suite, then throw away the image again, and build a new one, this time
without development headers and tools, and install your build
artifacts in it. This final image is then “production-ready”, and only
contains your built program and the minimal set of packages you
configured otherwise. Such an image could then be deployed with
casync (or any other tool of course) to be delivered to your set of
servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on
today’s technology, not yesteryear’s. Specifically this means that
we’ll generate GPT partition tables, not MBR/DOS ones. When you tell
mkosi to generate a bootable image for you, it will make it bootable
on EFI, not on legacy BIOS. The GPT images generated follow
specifications such as the Discoverable Partitions
Specification
,
so that /etc/fstab can remain unpopulated and tools such as
systemd-nspawn can automatically dissect the image and boot from
them.

So, let’s have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional
options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build
images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same
feature level currently. Also, as mkosi is based on dnf
--installroot
, debootstrap, pacstrap and zypper, and those
packages are not packaged universally on all distributions, you might
not be able to build images for all those distributions on arbitrary
host distributions.

The GPT images are put together in a way that they aren’t just
compatible with UEFI systems, but also with VM and container managers
(that is, at least the smart ones, i.e. VM managers that know UEFI,
and container managers that grok GPT disk images) to a large
degree. In fact, the idea is that you can use mkosi to build a
single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd’s RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used
if available to ensure the image is not tampered with (yes, you read
that right, systemd-nspawn and systemd’s RootImage= setting
automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without
parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you,
will call it image.raw and drop it in the current directory. The
distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is
put together, i.e. select package sets, select the distribution, size
partitions and so on. Most of that you can actually specify on the
command line, but it is recommended to instead create a couple of
mkosi.$SOMETHING files and directories in some directory. Then,
simply change to that directory and run mkosi without any further
arguments. The tool will then look in the current working directory
for these files and directories and make use of them (similar to how
make looks for a Makefile…). Every single file/directory is
optional, but if they exist they are honored. Here’s a list of the
files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you
    can configure what kind of image you want, which distribution, which
    packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy
    everything inside it into the images built. You can place arbitrary
    directory hierarchies in here, and they’ll be copied over whatever is
    already in the image, after it was put together by the distribution’s
    package manager. This is the best way to drop additional static files
    into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build
    script. When it exists, mkosi will build two images, one after the
    other in the mode already mentioned above: the first version is the
    build image, and may include various build-time dependencies such as
    a compiler or development headers. The build script is also copied
    into it, and then run inside it. The script should then build
    whatever shall be built and place the result in $DESTDIR (don’t
    worry, popular build tools such as Automake or Meson all honor
    $DESTDIR anyway, so there’s not much to do here explicitly). It may
    also run a test suite, or anything else you like. After the script
    finished, the build image is removed again, and a second image (the
    final image) is built. This time, no development packages are
    included, and the build script is not copied into the image again —
    however, the build artifacts from the first run (i.e. those placed in
    $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked
    inside the image (inside a systemd-nspawn invocation) and can
    adjust the image as it likes at a very late point in the image
    preparation. If mkosi.build exists, i.e. the dual-phased
    development build process used, then this script will be invoked
    twice: once inside the build image and once inside the final
    image. The first parameter passed to the script clarifies which phase
    it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a
    container configuration file for systemd-nspawn (see
    systemd.nspawn(5)
    for details), which shall be shipped along with the final image and
    shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package
    cache directory for the builds. This directory is effectively bind
    mounted into the image at build time, in order to speed up building
    images. The package installers of the various distributions will
    place their package files here, so that subsequent runs can reuse
    them.

  7. mkosi.passphrase — If this file exists, it should contain a
    pass-phrase to use for the LUKS encryption (if that’s enabled for the
    image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an
    X.509 key pair to use for signing the kernel and initrd for UEFI
    SecureBoot, if that’s enabled.

How to use it

So, let’s come back to our most trivial example, without any of the
mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current
directory. How do we use it? Of course, we could dd it onto some USB
stick and boot it on a bare-metal device. However, it’s much simpler
to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let’s make things more interesting. Let’s still not use any of
the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it’s no
longer GPT + ext4, but GPT + btrfs. Moreover, the system is made
bootable on UEFI systems, and finally, the output is now called
foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except
that this uses full VM virtualization rather than container
virtualization. (Note that the way to run a UEFI qemu/kvm instance
appears to change all the time and is different on the various
distributions. It’s quite annoying, and I can’t really tell you what
the right qemu command line is to make this work on your system.)

Of course, it’s not all raw GPT disk images with mkosi. Let’s try
a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can’t boot
it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask
mkosi to generate a compressed GPT image with a root squashfs,
compress the result with xz, and generate a SHA256SUMS file with
the hashes of the generated artifacts. The package will contain the
SSH client as well as everybody’s favorite editor.

Now, let’s make use of the various mkosi.$SOMETHING files. Let’s
say we are working on some Automake-based project and want to make it
easy to generate a disk image off the development tree with the
version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let’s add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let’s try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you’ll notice that building an image like
this can be quite slow. And slow build times are actively hurtful to
your productivity as a developer. Hence let’s make things a bit
faster. First, let’s make use of a package cache shared between runs:

# mkdir mkosi.cache

Building images now should already be substantially faster (and
generate less network traffic) as the packages will now be downloaded
only once and reused. However, you’ll notice that unpacking all those
packages and the rest of the work is still quite slow. But mkosi can
help you with that. Simply use mkosi‘s incremental build feature. In
this mode mkosi will make a copy of the build and final images
immediately before dropping in your build sources or artifacts, so
that building an image becomes a lot quicker: instead of always
starting totally from scratch a build will now reuse everything it can
reuse from a previous run, and immediately begin with building your
sources rather than the build image to build your sources in. To
enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated
anymore from your distribution’s servers, as the cached copy is made
after all packages are installed, and hence until you actually delete
the cached copy the distribution’s network servers aren’t contacted
again and no RPMs or DEBs are downloaded. This means the distribution
you use becomes “frozen in time” this way. (Which might be a bad
thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you’ll notice that it
won’t overwrite the generated image when it already exists. You can
either delete the file yourself first (rm image.raw) or let mkosi
do it for you right before building a new image, with mkosi -f. You
can also tell mkosi to not only remove any such pre-existing images,
but also remove any cached copies of the incremental feature, by using
-f twice.

I wrote mkosi originally in order to test systemd, and quickly
generate a disk image of various distributions with the most current
systemd version from git, without all that affecting my host system. I
regularly use mkosi for that today, in incremental mode. The two
commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the
very newest set of RPMs provided by Fedora, instead of a cached
snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git
tree:
mkosi.default
and
mkosi.build. This
way, any developer who wants to quickly test something with current
systemd git, or wants to prepare a patch based on it and test it can
check out the systemd repository and simply run mkosi in it and a
few minutes later he has a bootable image he can test in
systemd-nspawn or KVM. casync has similar files:
mkosi.default,
mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled
    disk images if you ask for it. For that use the --verity switch on
    the command line or Verity= setting in mkosi.default. Of course,
    dm-verity implies that the root volume is read-only. In this mode
    the top-level dm-verity hash will be placed along-side the output
    disk image in a file named the same way, but with the .roothash
    suffix. If the image is to be created bootable, the root hash is also
    included on the kernel command line in the roothash= parameter,
    which current systemd versions can use to both find and activate the
    root partition in a dm-verity protected way. BTW: it’s a good idea
    to combine this dm-verity mode with the raw_squashfs image mode,
    to generate a genuinely protected, compressed image suitable for
    running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum
    file SHA256SUMS for you (--checksum) covering all the files it
    outputs (which could be the image file itself, a matching .nspawn
    file using the mkosi.nspawn file mentioned above, as well as the
    .roothash file for the dm-verity root hash.) It can then
    optionally sign this with gpg (--sign). Note that systemd‘s
    machinectl pull-tar and machinectl pull-raw command can download
    these files and the SHA256SUMS file automatically and verify things
    on download. With other words: what mkosi outputs is perfectly
    ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To
    make use of that, place your X.509 key pair in two files
    mkosi.secureboot.crt and mkosi.secureboot.key, and set
    SecureBoot= or --secure-boot. If so, mkosi will sign the
    kernel/initrd/kernel command line combination during the build. Of
    course, if you use this mode, you should also use
    Verity=/--verity=, otherwise the setup makes only partial
    sense. Note that mkosi will not help you with actually enrolling
    the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes
    it is run in a git checkout and you use the mkosi.build script
    stuff, the source tree will be copied into the build image, but will
    all files excluded by .gitignore removed.

  5. There’s support for encryption in place. Use --encrypt= or
    Encrypt=. Note that the UEFI ESP is never encrypted though, and the
    root partition only if explicitly requested. The /home and /srv
    partitions are unconditionally encrypted if that’s enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line
    arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies,
listed in the
README. Most
notably you need a somewhat recent systemd version to make use of its
full feature set: systemd 233. Older versions are already packaged for
various distributions, but much of what I describe above is only
available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn’t
available in Fedora, but there’s a
COPR
.

Future

It is my intention to continue turning mkosi into a tool suitable
for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and
systemd/sd-boot native support for A/B IoT style partition
setups. The idea is that the combination of systemd, casync and
mkosi provides generic building blocks for building secure,
auto-updating devices in a generic way from, even though all pieces
may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like
    $SOMEOTHERPROJECT!
    — Well, to my knowledge there’s no tool that
    integrates this nicely with your project’s development tree, and can
    do dm-verity and UEFI SecureBoot and all that stuff for you. So
    nope, I don’t think this exactly like $SOMEOTHERPROJECT, thank you
    very much.

  2. What about creating MBR/DOS partition images? — That’s really
    out of focus to me. This is an exercise in figuring out how generic
    OSes and devices in the future should be built and an attempt to
    commoditize OS image building. And no, the future doesn’t speak MBR,
    sorry. That said, I’d be quite interested in adding support for
    booting on Raspberry Pi, possibly using a hybrid approach, i.e. using
    a GPT disk label, but arranging things in a way that the Raspberry Pi
    boot protocol (which is built around DOS partition tables), can still
    work.

  3. Is this portable? — Well, depends what you mean by
    portable. No, this tool runs on Linux only, and as it uses
    systemd-nspawn during the build process it doesn’t run on
    non-systemd systems either. But then again, you should be able to
    create images for any architecture you like with it, but of course if
    you want the image bootable on bare-metal systems only systems doing
    UEFI are supported (but systemd-nspawn should still work fine on
    them).

  4. Where can I get this stuff? — Try
    GitHub. And some distributions
    carry packaged versions, but I think none of them the current v3
    yet.

  5. Is this a systemd project? — Yes, it’s hosted under the
    systemd GitHub umbrella. And yes,
    during run-time systemd-nspawn in a current version is required. But
    no, the code-bases are separate otherwise, already because systemd
    is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no?
    Yes, but the feature we need kind of matters (systemd-nspawn‘s
    --overlay= switch), and again, this isn’t supposed to be a tool for
    legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am
    not an LXC nor Docker guy. If you select directory or subvolume
    as image type, LXC should be able to boot the generated images just
    fine, but I didn’t try. Last time I looked, Docker doesn’t permit
    running proper init systems as PID 1 inside the container, as they
    define their own run-time without intention to emulate a proper
    system. Hence, no I don’t think it will work, at least not with an
    unpatched Docker version. That said, again, don’t ask me questions
    about Docker, it’s not precisely my area of expertise, and quite
    frankly I am not a fan. To my knowledge neither LXC nor Docker are
    able to run containers directly off GPT disk images, hence the
    various raw_xyz image types are definitely not compatible with
    either. That means if you want to generate a single raw disk image
    that can be booted unmodified both in a container and on bare-metal,
    then systemd-nspawn is the container manager to go for
    (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that’s up to you really.

If you hack on some complex project and need a quick way to compile
and run your project on a specific current Linux distribution, then
mkosi is an excellent way to do that. Simply drop the mkosi.default
and mkosi.build files in your git tree and everything will be
easy. (And of course, as indicated above: if the project you are
hacking on happens to be called systemd or casync be aware that
those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great
choice too, as it will make it reasonably easy to generate secure
images that are protected against offline modification, by using
dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a
VM or systemd-nspawn container, or a portable service then mkosi
is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd
init systems, old VM managers, Docker, … then no, mkosi is not for
you, but there are plenty of well-established alternatives around that
cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to
accept your patches and other contributions.

Oh, and one unrelated last thing: don’t forget to submit your talk
proposal

and/or buy a ticket for
All Systems Go! 2017 in Berlin — the
conference where things like systemd, casync and mkosi are
discussed, along with a variety of other Linux userspace projects used
for building systems.

How to Create an AMI Builder with AWS CodeBuild and HashiCorp Packer – Part 2

Post Syndicated from Heitor Lessa original https://aws.amazon.com/blogs/devops/how-to-create-an-ami-builder-with-aws-codebuild-and-hashicorp-packer-part-2/

Written by AWS Solutions Architects Jason Barto and Heitor Lessa

 
In Part 1 of this post, we described how AWS CodeBuild, AWS CodeCommit, and HashiCorp Packer can be used to build an Amazon Machine Image (AMI) from the latest version of Amazon Linux. In this post, we show how to use AWS CodePipeline, AWS CloudFormation, and Amazon CloudWatch Events to continuously ship new AMIs. We use Ansible by Red Hat to harden the OS on the AMIs through a well-known set of security controls outlined by the Center for Internet Security in its CIS Amazon Linux Benchmark.

You’ll find the source code for this post in our GitHub repo.

At the end of this post, we will have the following architecture:

Requirements

 
To follow along, you will need Git and a text editor. Make sure Git is configured to work with AWS CodeCommit, as described in Part 1.

Technologies

 
In addition to the services and products used in Part 1 of this post, we also use these AWS services and third-party software:

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.

Amazon CloudWatch Events enables you to react selectively to events in the cloud and in your applications. Specifically, you can create CloudWatch Events rules that match event patterns, and take actions in response to those patterns.

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. AWS CodePipeline builds, tests, and deploys your code every time there is a code change, based on release process models you define.

Amazon SNS is a fast, flexible, fully managed push notification service that lets you send individual messages or to fan out messages to large numbers of recipients. Amazon SNS makes it simple and cost-effective to send push notifications to mobile device users or email recipients. The service can even send messages to other distributed services.

Ansible is a simple IT automation system that handles configuration management, application deployment, cloud provisioning, ad-hoc task-execution, and multinode orchestration.

Getting Started

 
We use CloudFormation to bootstrap the following infrastructure:

Component Purpose
AWS CodeCommit repository Git repository where the AMI builder code is stored.
S3 bucket Build artifact repository used by AWS CodePipeline and AWS CodeBuild.
AWS CodeBuild project Executes the AWS CodeBuild instructions contained in the build specification file.
AWS CodePipeline pipeline Orchestrates the AMI build process, triggered by new changes in the AWS CodeCommit repository.
SNS topic Notifies subscribed email addresses when an AMI build is complete.
CloudWatch Events rule Defines how the AMI builder should send a custom event to notify an SNS topic.
Region AMI Builder Launch Template
N. Virginia (us-east-1)
Ireland (eu-west-1)

After launching the CloudFormation template linked here, we will have a pipeline in the AWS CodePipeline console. (Failed at this stage simply means we don’t have any data in our newly created AWS CodeCommit Git repository.)

Next, we will clone the newly created AWS CodeCommit repository.

If this is your first time connecting to a AWS CodeCommit repository, please see instructions in our documentation on Setup steps for HTTPS Connections to AWS CodeCommit Repositories.

To clone the AWS CodeCommit repository (console)

  1. From the AWS Management Console, open the AWS CloudFormation console.
  2. Choose the AMI-Builder-Blogpost stack, and then choose Output.
  3. Make a note of the Git repository URL.
  4. Use git to clone the repository.

For example: git clone https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/AMI-Builder_repo

To clone the AWS CodeCommit repository (CLI)

# Retrieve CodeCommit repo URL
git_repo=$(aws cloudformation describe-stacks --query 'Stacks[0].Outputs[?OutputKey==`GitRepository`].OutputValue' --output text --stack-name "AMI-Builder-Blogpost")

# Clone repository locally
git clone ${git_repo}

Bootstrap the Repo with the AMI Builder Structure

 
Now that our infrastructure is ready, download all the files and templates required to build the AMI.

Your local Git repo should have the following structure:

.
├── ami_builder_event.json
├── ansible
├── buildspec.yml
├── cloudformation
├── packer_cis.json

Next, push these changes to AWS CodeCommit, and then let AWS CodePipeline orchestrate the creation of the AMI:

git add .
git commit -m "My first AMI"
git push origin master

AWS CodeBuild Implementation Details

 
While we wait for the AMI to be created, let’s see what’s changed in our AWS CodeBuild buildspec.yml file:

...
phases:
  ...
  build:
    commands:
      ...
      - ./packer build -color=false packer_cis.json | tee build.log
  post_build:
    commands:
      - egrep "${AWS_REGION}\:\sami\-" build.log | cut -d' ' -f2 > ami_id.txt
      # Packer doesn't return non-zero status; we must do that if Packer build failed
      - test -s ami_id.txt || exit 1
      - sed -i.bak "s/<<AMI-ID>>/$(cat ami_id.txt)/g" ami_builder_event.json
      - aws events put-events --entries file://ami_builder_event.json
      ...
artifacts:
  files:
    - ami_builder_event.json
    - build.log
  discard-paths: yes

In the build phase, we capture Packer output into a file named build.log. In the post_build phase, we take the following actions:

  1. Look up the AMI ID created by Packer and save its findings to a temporary file (ami_id.txt).
  2. Forcefully make AWS CodeBuild to fail if the AMI ID (ami_id.txt) is not found. This is required because Packer doesn’t fail if something goes wrong during the AMI creation process. We have to tell AWS CodeBuild to stop by informing it that an error occurred.
  3. If an AMI ID is found, we update the ami_builder_event.json file and then notify CloudWatch Events that the AMI creation process is complete.
  4. CloudWatch Events publishes a message to an SNS topic. Anyone subscribed to the topic will be notified in email that an AMI has been created.

Lastly, the new artifacts phase instructs AWS CodeBuild to upload files built during the build process (ami_builder_event.json and build.log) to the S3 bucket specified in the Outputs section of the CloudFormation template. These artifacts can then be used as an input artifact in any later stage in AWS CodePipeline.

For information about customizing the artifacts sequence of the buildspec.yml, see the Build Specification Reference for AWS CodeBuild.

CloudWatch Events Implementation Details

 
CloudWatch Events allow you to extend the AMI builder to not only send email after the AMI has been created, but to hook up any of the supported targets to react to the AMI builder event. This event publication means you can decouple from Packer actions you might take after AMI completion and plug in other actions, as you see fit.

For more information about targets in CloudWatch Events, see the CloudWatch Events API Reference.

In this case, CloudWatch Events should receive the following event, match it with a rule we created through CloudFormation, and publish a message to SNS so that you can receive an email.

Example CloudWatch custom event

[
        {
            "Source": "com.ami.builder",
            "DetailType": "AmiBuilder",
            "Detail": "{ \"AmiStatus\": \"Created\"}",
            "Resources": [ "ami-12cd5guf" ]
        }
]

Cloudwatch Events rule

{
  "detail-type": [
    "AmiBuilder"
  ],
  "source": [
    "com.ami.builder"
  ],
  "detail": {
    "AmiStatus": [
      "Created"
    ]
  }
}

Example SNS message sent in email

{
    "version": "0",
    "id": "f8bdede0-b9d7...",
    "detail-type": "AmiBuilder",
    "source": "com.ami.builder",
    "account": "<<aws_account_number>>",
    "time": "2017-04-28T17:56:40Z",
    "region": "eu-west-1",
    "resources": ["ami-112cd5guf "],
    "detail": {
        "AmiStatus": "Created"
    }
}

Packer Implementation Details

 
In addition to the build specification file, there are differences between the current version of the HashiCorp Packer template (packer_cis.json) and the one used in Part 1.

Variables

  "variables": {
    "vpc": "{{env `BUILD_VPC_ID`}}",
    "subnet": "{{env `BUILD_SUBNET_ID`}}",
         “ami_name”: “Prod-CIS-Latest-AMZN-{{isotime \”02-Jan-06 03_04_05\”}}”
  },
  • ami_name: Prefixes a name used by Packer to tag resources during the Builders sequence.
  • vpc and subnet: Environment variables defined by the CloudFormation stack parameters.

We no longer assume a default VPC is present and instead use the VPC and subnet specified in the CloudFormation parameters. CloudFormation configures the AWS CodeBuild project to use these values as environment variables. They are made available throughout the build process.

That allows for more flexibility should you need to change which VPC and subnet will be used by Packer to launch temporary resources.

Builders

  "builders": [{
    ...
    "ami_name": “{{user `ami_name`| clean_ami_name}}”,
    "tags": {
      "Name": “{{user `ami_name`}}”,
    },
    "run_tags": {
      "Name": “{{user `ami_name`}}",
    },
    "run_volume_tags": {
      "Name": “{{user `ami_name`}}",
    },
    "snapshot_tags": {
      "Name": “{{user `ami_name`}}",
    },
    ...
    "vpc_id": "{{user `vpc` }}",
    "subnet_id": "{{user `subnet` }}"
  }],

We now have new properties (*_tag) and a new function (clean_ami_name) and launch temporary resources in a VPC and subnet specified in the environment variables. AMI names can only contain a certain set of ASCII characters. If the input in project deviates from the expected characters (for example, includes whitespace or slashes), Packer’s clean_ami_name function will fix it.

For more information, see functions on the HashiCorp Packer website.

Provisioners

  "provisioners": [
    {
        "type": "shell",
        "inline": [
            "sudo pip install ansible"
        ]
    }, 
    {
        "type": "ansible-local",
        "playbook_file": "ansible/playbook.yaml",
        "role_paths": [
            "ansible/roles/common"
        ],
        "playbook_dir": "ansible",
        "galaxy_file": "ansible/requirements.yaml"
    },
    {
      "type": "shell",
      "inline": [
        "rm .ssh/authorized_keys ; sudo rm /root/.ssh/authorized_keys"
      ]
    }

We used shell provisioner to apply OS patches in Part 1. Now, we use shell to install Ansible on the target machine and ansible-local to import, install, and execute Ansible roles to make our target machine conform to our standards.

Packer uses shell to remove temporary keys before it creates an AMI from the target and temporary EC2 instance.

Ansible Implementation Details

 
Ansible provides OS patching through a custom Common role that can be easily customized for other tasks.

CIS Benchmark and Cloudwatch Logs are implemented through two Ansible third-party roles that are defined in ansible/requirements.yaml as seen in the Packer template.

The Ansible provisioner uses Ansible Galaxy to download these roles onto the target machine and execute them as instructed by ansible/playbook.yaml.

For information about how these components are organized, see the Playbook Roles and Include Statements in the Ansible documentation.

The following Ansible playbook (ansible</playbook.yaml) controls the execution order and custom properties:

---
- hosts: localhost
  connection: local
  gather_facts: true    # gather OS info that is made available for tasks/roles
  become: yes           # majority of CIS tasks require root
  vars:
    # CIS Controls whitepaper:  http://bit.ly/2mGAmUc
    # AWS CIS Whitepaper:       http://bit.ly/2m2Ovrh
    cis_level_1_exclusions:
    # 3.4.2 and 3.4.3 effectively blocks access to all ports to the machine
    ## This can break automation; ignoring it as there are stronger mechanisms than that
      - 3.4.2 
      - 3.4.3
    # CloudWatch Logs will be used instead of Rsyslog/Syslog-ng
    ## Same would be true if any other software doesn't support Rsyslog/Syslog-ng mechanisms
      - 4.2.1.4
      - 4.2.2.4
      - 4.2.2.5
    # Autofs is not installed in newer versions, let's ignore
      - 1.1.19
    # Cloudwatch Logs role configuration
    logs:
      - file: /var/log/messages
        group_name: "system_logs"
  roles:
    - common
    - anthcourtney.cis-amazon-linux
    - dharrisio.aws-cloudwatch-logs-agent

Both third-party Ansible roles can be easily configured through variables (vars). We use Ansible playbook variables to exclude CIS controls that don’t apply to our case and to instruct the CloudWatch Logs agent to stream the /var/log/messages log file to CloudWatch Logs.

If you need to add more OS or application logs, you can easily duplicate the playbook and make changes. The CloudWatch Logs agent will ship configured log messages to CloudWatch Logs.

For more information about parameters you can use to further customize third-party roles, download Ansible roles for the Cloudwatch Logs Agent and CIS Amazon Linux from the Galaxy website.

Committing Changes

 
Now that Ansible and CloudWatch Events are configured as a part of the build process, commiting any changes to the AWS CodeComit Git Repository will triger a new AMI build process that can be followed through the AWS CodePipeline console.

When the build is complete, an email will be sent to the email address you provided as a part of the CloudFormation stack deployment. The email serves as notification that an AMI has been built and is ready for use.

Summary

 
We used AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, Packer, and Ansible to build a pipeline that continuously builds new, hardened CIS AMIs. We used Amazon SNS so that email addresses subscribed to a SNS topic are notified upon completion of the AMI build.

By treating our AMI creation process as code, we can iterate and track changes over time. In this way, it’s no different from a software development workflow. With that in mind, software patches, OS configuration, and logs that need to be shipped to a central location are only a git commit away.

Next Steps

 
Here are some ideas to extend this AMI builder:

  • Hook up a Lambda function in Cloudwatch Events to update EC2 Auto Scaling configuration upon completion of the AMI build.
  • Use AWS CodePipeline parallel steps to build multiple Packer images.
  • Add a commit ID as a tag for the AMI you created.
  • Create a scheduled Lambda function through Cloudwatch Events to clean up old AMIs based on timestamp (name or additional tag).
  • Implement Windows support for the AMI builder.
  • Create a cross-account or cross-region AMI build.

Cloudwatch Events allow the AMI builder to decouple AMI configuration and creation so that you can easily add your own logic using targets (AWS Lambda, Amazon SQS, Amazon SNS) to add events or recycle EC2 instances with the new AMI.

If you have questions or other feedback, feel free to leave it in the comments or contribute to the AMI Builder repo on GitHub.