How to enhance Amazon Macie data discovery capabilities using Amazon Textract

Post Syndicated from ZhiWei Huang original https://aws.amazon.com/blogs/security/how-to-enhance-amazon-macie-data-discovery-capabilities-using-amazon-textract/

Amazon Macie is a managed service that uses machine learning (ML) and deterministic pattern matching to help discover sensitive data that’s stored in Amazon Simple Storage Service (Amazon S3) buckets. Macie can detect sensitive data in many different formats, including commonly used compression and archive formats. However, Macie doesn’t support the discovery of sensitive data within images, audio, video, or other types of multimedia content. Customers often ask how to effectively detect whether there’s sensitive data in images. This can be a significant challenge for organizations, especially those operating in highly regulated industries with strict data protection requirements.

In this post, we show you how to gain visibility of sensitive data embedded in images that are stored within your S3 buckets by adding an additional conversion layer to extract image-based data into a format supported by Macie. The solution also uses the recommended set of managed identifiers and custom data identifiers supported by Macie to cover most use cases.

Solution overview

In this section, we walk through the components of the solution. The solution is deployed using AWS Serverless Application Model (AWS SAM), which is an open source framework for building serverless applications. AWS SAM helps to organize related components and operate on a single stack. When used together with the AWS SAM CLI, it’s a useful tool for developing, testing, and building serverless applications on AWS. We provided an AWS SAM template that you can use to set up the required services and AWS Lambda functions. Figure 1 illustrates the architecture of the solution.

Figure 1: Solution architecture and workflow

Figure 1: Solution architecture and workflow

The solution workflow is as follows:

  1. A user uploads images that might contain sensitive data into the S3 bucket.
  2. After you have verified that potentially sensitive data has been uploaded into the S3 bucket, you can manually invoke the Lambda function textract-trigger to start the process. This function calls Amazon Textract asynchronously to process files in the S3 bucket with filename extensions such as .png, .jpg, and .jpeg. Amazon Textract creates a job for each image and extracts the text found in each image.
  3. Because the operation is asynchronous, the job ID and status of each call is stored in an Amazon DynamoDB table to track the status of jobs and make sure that all of the jobs are completed before Macie is triggered to scan the S3 bucket.
  4. The resulting JSON file from the Amazon Textract job is stored within the same S3 bucket as the original image.
  5. For each analysis job, Amazon Textract sends a job completion notification to the registered Amazon Simple Notification Service (Amazon SNS) topic AmazonTextractJobSNSTopic. The Lambda function macie-trigger is subscribed to the SNS topic and triggered every time an SNS message is received for a completed Amazon Textract analysis job.
  6. Further post-processing is done by the Lambda function macie-trigger to extract the values from the JSON file into a text file. This text file is then uploaded into the same S3 bucket.
  7. The function then checks for other in-progress Amazon Textract jobs in the DynamoDB table. If there are pending jobs, the function exits and waits to be triggered again.
  8. After all of the Amazon Textract jobs are marked as complete in the DynamoDB table, the Lambda function macie-trigger creates a Macie classification job.
  9. Macie then scans the bucket for sensitive data based on managed identifiers and your custom data identifiers.
  10. Macie will continuously publish the classification job status to Amazon CloudWatch Logs.
  11. It might take some time to scan all the files in the S3 bucket, and you will be notified through SNS email when the Macie job is completed. The Lambda function MacieCompletedSNSLambda will filter for completed job status and send an email notification using the SNS topic MacieSnsTopic.

When deploying the solution, you can specify an existing S3 bucket in your AWS account that’s already storing data that might be sensitive or deploy a new S3 bucket as part of the setup. If you specify an existing S3 bucket, make sure that there are no additional statements in the bucket policy or KMS key policy that will deny the relevant solution components access to the S3 bucket. If no existing S3 bucket is specified, a new S3 bucket will be created with the name s3-with-sensitive-data-<account-id>-<random-string>.

Prerequisites

Before deploying the solution, make sure the following prerequisites are in place.

  • Enable Macie in your account. For instructions, see Getting Started with Amazon Macie.
  • Determine the regular expression (regex) pattern for sensitive textual data that you want Macie to detect. This will allow you to create custom data identifiers that complement the managed data identifiers provided by Macie. For more information, see Building custom data identifiers in Amazon Macie. There’s an example in the pre-deployment steps that you can follow, with the sample images that come with the solution.
  • Make sure that you have the permissions to deploy the AWS services detailed in the solution: Lambda, Amazon S3, Amazon Textract, Amazon SNS, Macie, Amazon CloudWatch, and DynamoDB.
  • Install the AWS SAM CLI, which you will use to deploy the solution. To learn more about how AWS SAM works, see The AWS SAM project and AWS SAM template.

Pre-deployment steps

With the prerequisites in place, you need to set up one or more custom identifiers through the AWS Management Console for Macie before you can deploy the solution. Use the following steps to set up an example custom identifier for the images provided in this post.

To set up custom identifiers:

  1. Navigate to the Amazon Macie console.
  2. Choose Custom data identifiers in the navigation pane, and then choose Create.
  3. Enter a name and description for the custom identifier, such as the following examples:
    1. Name: Singapore NRIC Number
    2. Description: This expression can be used to find or validate a Singapore NRIC Number that begins with the character S, F, T, or G, followed by seven digits and ending with any character from A to Z.
  4. For Regular expression, enter: [SFTG]\d{7}[A-Z].
  5. For Keywords, enter: Singapore,Identity, Card.
    Keywords are important because they can help to improve the accuracy of the detection and refine the results.
  6. Leave the other fields as default and choose Submit.
  7. Navigate to the newly created custom identifier and note the ID. This ID is required as an input when deploying the AWS SAM solution.
    Figure 2: ID of a newly created Macie custom identifier

    Figure 2: ID of a newly created Macie custom identifier

Deploy the solution

With the prerequisites in place and pre-deployment steps complete, you’re ready to deploy the solution.

To deploy the solution:

  1. Open a CLI window, navigate to your preferred local directory and run git clone https://github.com/aws-samples/enhancing-macie-with-textract.
    1. Navigate to this directory by using cd enhancing-macie-with-textract.
    2. Run sam deploy --guided and follow the step-by-step instructions to indicate the deployment details such as the desired CloudFormation stack name, AWS Region, and other details. The following are descriptions of some of the requested parameters:
      • ExistingS3BucketName: This is the name of the S3 bucket that you want the solution to scan. This is an optional parameter. If it’s left blank, the solution will create an S3 bucket for you to store the objects that you want to scan.
      • MacieCustomCustomIdentifierIDList: This is the ID that you noted in the final pre-deployment step. Use this field to enter a list of custom identifiers for Macie to detect with. If there is more than one ID, each ID should be separated by a comma (for example, 59fd2814-0ba8-41cc-adb2-1ffec6a0bb3c, 665cf948-ea30-42df-9f63-9a858cbfe1a8).
      • EmailAddress: This is the email address that you want Amazon SNS email notifications to be sent to when a Macie job is complete.
      • MacieLogGroupExists: This checks if you have an existing Macie CloudWatch Log Group (/aws/macie/classificationjobs). If this is your first time running a Macie job, enter No or n. Otherwise, enter Yes or y.
  2. When completed, a confirmation request will be presented for the creation of the required resources. AWS SAM creates a default S3 bucket to store the necessary resources and then proceeds to the deployment prompt. Enter y to deploy and wait for deployment to complete.
  3. After deployment is complete, you should see the following output: Successfully created/updated stack – {StackName} in {AWSRegion}. You can review the resources and stack in the CloudFormation console.
    Figure 3: CloudFormation console of the deployed stack

    Figure 3: CloudFormation console of the deployed stack

  4. An email will be sent from [email protected] to the email address that you entered in step 3. Choose Confirm subscription to allow SNS to send you Macie job completion emails.
    Figure 4: Sample email from Amazon SNS for subscription confirmation

    Figure 4: Sample email from Amazon SNS for subscription confirmation

Test the solution

With the solution deployed, use a set of sample images to verify that it can detect sensitive data within images.

To test the solution:

  1. Use the Amazon S3 console to navigate to the bucket you specified during deployment. If you didn’t specify an S3 bucket to scan, look for a new bucket named s3-with-sensitive-data-<account-id>-<random-string>.
  2. In your project directory, there are sample images in sample-images.zip. Unzip the file and upload the sample images into the S3 bucket. The sample images include a US driver’s license, social security card, passport, and a Singapore National Registration Identity Card (NRIC).
  3. Navigate to the AWS Lambda console and select the {StackName}-TextractTriggerLambda-<random-string> function.
  4. Choose the Test tab and then choose Test to start the automated sensitive data discovery process for the uploaded images.
    Figure 5: Trigger an Amazon Textract scan on all images in the S3 bucket

    Figure 5: Trigger an Amazon Textract scan on all images in the S3 bucket

  5. The whole process will take about 15 minutes to complete. You will receive an email notification after the Macie scan is completed.
    Figure 6: Sample email from Amazon SNS for Macie job completion

    Figure 6: Sample email from Amazon SNS for Macie job completion

  6. Navigate to the Amazon Macie console and select Jobs in the navigation pane. You should see the job Scan for [number of] objects [datetime stamp] that matches the job name shown in the email notification.
  7. In the details panel, choose Show results button and then choose Show findings.
    Figure 7: Show Macie data discovery job findings

    Figure 7: Show Macie data discovery job findings

  8. You will see the findings related to the Macie sensitive data discovery job ID that you selected.
    Figure 8: Findings from the data discovery job

    Figure 8: Findings from the data discovery job

Understanding the findings

In this section, we take a closer look at each finding.

  1. In the console, look in the Resources affected column for the finding that ends with singapore-pink-nric-postprocessed.txt and select it. The finding type SensitiveData:S3Object/CustomIdentifier means that the resource contains text that matches the detection criteria of a custom data identifier. The other finding types in this example are from managed data identifiers. See Types of sensitive data findings for more information about Macie finding types.
  2. In the finding information panel, you can also see:
    1. In the Overview section, the resource indicates which resource contains sensitive data. The resource identifies the text file; however, you can identify the original image file because it has the same object name (other than the file type).
    2. In the Custom data identifiers section, you can see the type of sensitive data found. In this case, the finding involves data that matches the regex of a Singapore NRIC.

By using this solution, you can use Macie to detect sensitive data within the images in your S3 bucket and which images each finding corresponds to.

Using the solution

In this post, you have configured a single custom data identifier and a recommended set of managed identifiers. However, you can create and use multiple custom data identifiers in the solution by providing them as a comma-separated list, as mentioned in step 3 of Deploy the solution.

This solution has been designed to enable sensitive data discovery of text in image objects within a single S3 bucket. To expand the scope to include multiple S3 buckets, some additional code and permission changes are required to allow the Lambda functions to process and access multiple existing S3 buckets.

It’s important to note the language capabilities of Amazon Textract. Amazon Textract can extract printed text and handwriting from the standard English alphabet and ASCII symbols. Currently, Amazon Textract supports extraction in English, German, French, Italian, and Portuguese. For more information on what textual information Amazon Textract can identify, see Amazon Textract FAQs.

Clean up the resources

To clean up the resources that you created for this example:

  1. If you didn’t set up the scan on your own S3 bucket, empty the S3 bucket that was created as part of the solution. Open the Amazon S3 console, search for the bucket name s3-with-sensitive-data-<account-id>-<random-string> and choose Empty.
  2. Use one of the following methods to delete the CloudFormation stack:
    • Use the CloudFormation console to delete the stack.
    • Use AWS SAM CLI to run sam delete in your terminal. Follow the instructions and enter y when prompted to delete the stack.

Conclusion

In this post, you learned how to enhance the capabilities of Amazon Macie to conduct sensitive data discovery within image files. With this solution, you can extend the benefits of Amazon Macie beyond structured file formats.

If you want to extend the benefits of Amazon Macie to scan your databases for sensitive data, you might find these blog posts useful:

If you have feedback about this post, submit comments in the Comments section. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

ZhiWei Huang

ZhiWei Huang

ZhiWei is a Financial Services Solutions Architect at AWS. He works with FSI customers across the ASEAN region, providing guidance for establishing robust security controls and networking foundations as customers build on and scale with AWS. Outside of work, he finds joy in travelling the world and spending quality time with his family.

Edmund Yeo

Edmund Yeo

Edmund is a Security Solutions Architect based in Singapore. He works with customers across ASEAN region, helping them to continually raise the security bar and posture for them to innovate rapidly and securely. Outside of work, Edmund enjoys having a cup of coffee, seeing the world, and bouldering.

Ying Ting Ng

Ying Ting Ng

Ying Ting is an Associate Security Solutions Architect at AWS, where she supports ASEAN growth customers in scaling securely on the cloud. She also advises on architectural best practices to help customers meet industry compliance standards. An active member in Amazon Women in Security, Ying Ting shares insights on making an impact as an early-career cybersecurity professional.

Out With the Old, In With the New: Securely Disposing of Smart Devices

Post Syndicated from Deral Heiland original https://blog.rapid7.com/2025/01/06/out-with-the-old-in-with-the-new-securely-disposing-of-smart-devices/

Out With the Old, In With the New: Securely Disposing of Smart Devices

So, what did you get for Christmas this year?

Hopefully you received some cool smart technology, or maybe you just upgraded your smart camera or voice assistant to a newer model or version. If you upgraded to a new model or version, what is your plan for the old device? Is it still working or is it broken?

Either way, you will need to figure out what to do with it: Donate it, sell it online, or maybe dispose of it as electronic waste. Before you make up your mind, let’s think through a few things.

Have you done a factory reset?

The key reason you want to do a factory reset is to make sure the device is no longer customized according to your environment and that personal information such as WiFi passwords, email addresses, username and account passwords, and your name and home address are properly removed from the device prior to it leaving your hands.

Factory resets are accomplished in different ways, depending upon the device. For example, some devices have a button you press and hold, while others may use a mobile or web application to trigger the reset. I have also seen devices where you just cycle the power multiple times in sequence to reset the devices. Regardless of what the manufacturer’s recommended process is, it is very important that you follow it.

But what if the device appears to be broken? Well, if your old device truly is 100% dead, then there’s not much you can do about that. But the truth is, an IoT device may not be completely functioning yet it may still allow a factory reset to be done. For example, if the device appears to power up but doesn’t communicate correctly, you can try pushing and holding the reset button and see if the device resets (often indicated by the lights).

Now, let’s say the device has a web application online or a mobile application and it shows the device is online — such as a smart light bulb or an Amazon Echo — but you cannot get it to work correctly. For example, the LED lights on a smart bulb don’t light up or the Echo doesn’t respond to your voice. If the application is still showing the device is online, then chances are the network communication is still working, and the application may allow you to remote reset the device.

Again, if you can find a way to accomplish a factory reset, then you should do it.

What could go wrong?

What could happen if you don’t properly reset a device and then dispose of it by selling it or giving it to someone else, who may in turn sell the device?

Out of curiosity and an attempt to answer this question, I purchased a box full of previously owned Amazon Echos online. Many of them were supposed to be broken, with most of them marked “dead speakers.” With that said, I had an important question to answer. Did the owner use the Amazon Alexa online application to factory reset and remove the Echo device before selling or giving the device to the person who sold it to me? I proceeded to disassemble 10 of these devices and dumped their memory so that I could evaluate the results to see if any of them were still provisioned and contained any user data.

Out With the Old, In With the New: Securely Disposing of Smart Devices

Out of the 10 devices I examined, 4 were found to still be provisioned. As a small example of the potential data accessible on these 4 devices, I conducted further examination and found these devices still containing the WiFi SSID and Pre-shared Key (PSK) for the user’s home networks. Having the PSK in hand can give a malicious actor access to the user’s home network.

To make matters worse, in one of the 4 provisioned devices, the user used his last name for the SSID and his home phone number for PSK. Using personally identifiable information in an SSID, such as a name and/or phone number, greatly increases the ease in tracing a device back to a specific person and physical location. In other words, I highly recommend you not do this.

In the case of the Amazon Echo specifically, critical data such as personal Amazon authentication account information is currently stored in encrypted storage on the devices; therefore, it would take more work for someone to gain access to that, but I would not say it is impossible. Also, it’s important to note that although Echo devices may be encrypting the user account information they store, not all smart products on the market follow those recommendations. So — once again — to reduce the risk of your data being compromised, it is important that you factory reset your devices prior to disposal.

The proliferation of consumer-grade IoT devices in business.

It’s important to point out that the issue I’ve been discussing here doesn’t just apply to general consumers but also to businesses. Often, we assume that consumer-grade IoT technologies are only used by home users when in fact businesses of all sizes can and do leverage consumer-based IoT technology in the workplace environment.

It’s common to see WiFi access points, smart TVs, smart cameras, TV streaming boxes, smart exercise equipment, consumer-grade printers, and yes, even smart voice assistants, being used within a number of organizations. For example, every year I build out exercises for DEF CON IoT Village to help expose people to, and train them on, various aspects of hardware hacking. I purchase devices on the secondary market for this training and every year at least 40-50% of the devices I’ve purchased have not been factory reset. On more than one occasion, I have even purchased blocks of devices from a single reseller to find that those devices were not factory reset and were still configured with data from an operational business.

Start the year off securely.

To summarize, here are the key takeaways from my experiment and the bigger conversation around disposing of old smart devices:

  1. Business IT and security leaders should have clear, cradle-to-grave processes governing the IoT technologies purchased, so that the organization is not exposed to unnecessary risk. Cradle to grave covers initial installation and provisioning, ongoing maintenance, and, in the end, safe and secure disposal of the technology.
  2. Consumers should make sure to do a proper factory reset prior to the disposal or resale of any smart devices. Keep in mind that even if the device appears to be broken, a factory reset is still often possible — see my guidance earlier in this blog.
  3. If you cannot factory reset your device and you’re concerned about the data on the device, you can always change your SSID, PSK, and account password.

Finally, always remember to never dispose of your IoT technology in the trash, as landfills are not the proper place to send them. Instead, these devices should be sent to an electronic waste disposal option in your local area.

If you are looking for a deeper dive check out this research paper on Amazon echo dot devices from Northeastern University. Happy New Year!

Kernel prepatch 6.13-rc6

Post Syndicated from corbet original https://lwn.net/Articles/1004230/

Linus has released 6.13-rc6 for testing.

So we had a slight pickup in commits this last week, but as
expected and hoped for, things were still pretty quiet. About twice
as many commits as the holiday week, but that’s still not all that
many.

I expect things will start becoming more normal now that people are
back from the holidays and are starting to recover and wake up from
their food comas.

Beelink GTi12 Ultra Mini PC Review with GPU Dock Expansion

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/beelink-gti12-ultra-mini-pc-review-intel-amd-nvidia-gpu-dock-expansion/

In our Beelink GTi12 Ultra review, we see how this system with lots of expandability and even an option for an external GPU dock performs

The post Beelink GTi12 Ultra Mini PC Review with GPU Dock Expansion appeared first on ServeTheHome.

Кадастърът пусна отворени данни за всичко в България и са невероятни

Post Syndicated from Боян Юруков original https://yurukov.net/blog/2025/kais-opendata/

В края на миналата година писах накратко в социалките, че новият портал на кадастъра е публичен и на пръв поглед има доста полезна употреба – да се следи за незаконно дострояване. В действителност, с тази нова версия се случи нещо, за което натискахме и чакахме от много години – отворени данни за имотите в България.

Информацията, която търсихме са точните очертания на области, общини, землища и отделни парцели. Тази информация има много приложения отвъд работата с недвижими имоти. Преди 11 години, например, бях свалил парче по парче голяма част от тези данни и бях ги обобщил в опростена версия на картата на общините и селищата в България. Пуснах ги свободно с цел визуализации на данни и сам ги използвах в графиките си за данните за сеч, избори и редица други. Аналогично, за картата ми с документите за градоустройството в София през последните четири години съм свалил близо 25% от всички парцели в столицата. Подобно е положението и с 3D картата за застрояването. Данните влизаха в употреба в поне 10 от проектите и визуализациите ми до сега.

с. Априлци в Пазарджик

Затова с радост открих, че в новият портал на кадастъра има секция отворени данни, където може да се свали архивирана цялата спомената информация – отделни парцели, сгради и самостоятелни обекти в SHP формат. Вече свалих данните за община София и няма да се налага да товаря сървърите на НАГ и кадастъра всеки път като излезе нов документ. Данните съдържат много метаданни като площ, брой етажи, точен адрес, собственост, район, начин на употреба и документ определят последното. Дори само информацията за адресите е безценна, тъй като до сега нямаше такава публична база данни в България. Има дори адресите на самостоятелни обекти – гаражи, апартаменти и други части от сгради, включително къде се намират точно и колко е (законната) им площ.

Метаданни за случайна сграда в София

Това обаче далеч не е всичко. Публикувана е информация за собствеността на тези имоти – дали цялостно или частично, физическо, юридическо лице, община или държава и с какви документи и кога е установена тази собственост. Има дори ЕИК и имена на фирмите, а когато е частно лице, името е маскирано, а ЕГН-то е криптирано, така че да не се разбере, но да е пак уникално и да може да се съпостави с други записи. Това е безценна база данни с публична вече информация, която без да преувеличавам ще отвори нова страница в разследванията на злоупотреби на части и публични лица.

Пример за собствеността на парцели в с. Априлци, Пазарджик

Данните в този си вид са генерирани на 14-ти декември. Надявам се да имат възможност да ги обновяват редовно, особено собствеността. Все още липсва известна информация – за 6 общини липсват данни, включително община Варна. От тях, както и сред други общини липсва информация за около 380 селища или 7.2%. Изпратих и друга обратна връзка към създателите на портала и разбирам, че се работи по попълването на цялата информация. Две основни точки бяха двуезична документация и номенклатури, както и начин да се сваля всичко наведнъж.

Дори към този момент обаче е нещо внушително. Бих го сравнил само с публикуването на отворени данни от търговския регистър с тази разлика, че са много по-ясни, подредени и готови за употреба. Показах данните на няколко познати експерти работещи с GIS системи и събиращи информация от подобни регистри от цял свят. Един особено ми пише през няколко месеца да пита дали поне очертанията на парцелите в градовете имаме. Всички бяха удивени от качеството и пълнотата на информацията и метаданните, включително в сравнение с аналогични източници в Германия, Великобритания и щатите.

Центъра на София в сгради и парцели

Успях да сваля всички данни автоматично и вече ги преглеждам подробно. Особено тази за собствеността. Все още нямам идея какво от тези данни и как ще го покажа, но определено има доста какво да се направи с нея. Знам обаче, че ще използвам парцелите във визуализацията си за данните за сечта, които отворих наскоро. Особено за стартиращи бизнеси това ще е много полезно отвъд чистите GIS системи, недвижими имоти и планиране. Ако имате идеи как бихте използвали данните или вече сте направили нещо с тях, споделете го в коментарите.

The post Кадастърът пусна отворени данни за всичко в България и са невероятни first appeared on Блогът на Юруков.

The collective thoughts of the interwebz