As a lead backend engineer and an avid finance enthusiast, my journey has blended advanced programming with machine learning insights in the finance sector. Starting with Python, then moving to JavaScript and Go for their respective strengths, I eventually found that their capabilities were stretched by the demanding pace of my tasks. This led me to discover the efficiency of jq, GNU Parallel, and a file-based task queue. This powerful trio not only streamlined my stock market projects but also proved invaluable for parsing application logs and transforming data across various domains. Their impact was so profound that I’ve since incorporated them into complex tasks like transferring millions of events swiftly using SQS. I’m excited to share how these tools have simplified my approach beyond traditional programming.
What is jq?
jq is like a Swiss Army knife for JSON data. It’s a lightweight and flexible command-line JSON processor. You can slice, dice, transform, and reshape JSON data in whichever way you fancy. The best part? Its syntax is intuitive and powerful, perfect for quick one-liners or complex transformations.
Let’s take an example json that I build myself (chatGPT) for better understanding.
{
"products": [
{"id": 1, "name": "Laptop", "price": 800, "inStock": true},
{"id": 2, "name": "Smartphone", "price": 500, "inStock": false},
{"id": 3, "name": "Tablet", "price": 300, "inStock": true},
{"id": 4, "name": "Smartwatch", "price": 200, "inStock": true},
{"id": 5, "name": "Camera", "price": 150, "inStock": false}
]
}
Filtering by Price and Stock Availability
- Task: Select products priced over $300 and currently in stock.
- Command:Â
jq '.products[] | select(.price > 300 and .inStock == true)' products.json
- Output:Â
{ "id": 1, "name": "Laptop", "price": 800, "inStock": true }
Calculating Average Price
- Task: Compute the average price of all products.
- Command:Â
jq '[.products[].price] | add / length' products.json
- Output:Â
390
Combining and Deduplicating Data from Multiple Files
- Additional File (
additional_products.json
):Â[{"id": 6, "name": "Headphones", "price": 100, "inStock": true}, {"id": 1, "name": "Laptop", "price": 800, "inStock": true}]
- Task: Merge product data from two files, removing duplicates.
- Command:Â
jq -s 'add | unique_by(.id)' products.json additional_products.json
[
{“id”: 1, “name”: “Laptop”, “price”: 800, “inStock”: true},
{“id”: 6, “name”: “Headphones”, “price”: 100, “inStock”: true}
]
Converting JSON to CSV Format
- Command:Â
jq -r '.products[] | [.id, .name, .price] | @csv' products.json
1,"Laptop",800
2,"Smartphone",500
3,"Camera",150
As you can see we can do the things that we used to to with pandas or javascript etc. Imagine doing this calculation on mongodb exported documents data to maybe on an elastisearch query response (Elasticsearch has REST endpoints unlike others).
Now comes the GNU Parallel.
What is GNU Parallel?
GNU Parallel is a tool for executing jobs in parallel, using one or more computers. It’s like a magic wand that takes a bunch of tasks and distributes them across available CPU cores, speeding up processing significantly. It’s particularly useful for repetitive tasks that can be run concurrently.
Task: Run a complex command involving pipes and redirections.
Command:Â parallel "grep -E '^[^2][0-9]{2} ' {} | wc -l > {}.count" ::: access_log*.txt
UseCase: I heavily use this when searching things in logs, specifically when shit hits the floor # funtimes
Task: You want to run a set number of jobs in parallel from a larger list of tasks.
command:Â cat urls.txt | parallel -j 10 curl -O
UseCase**:**Â This will download files from a list of URLs in urls.txt, running 10 downloads in parallel at any time. (Torrents I guesscollegedays), I use it for data scrapping POC though # fuckMyLife.
Task: Execute tasks in parallel but keep the output in the same order as input.
command: parallel — keep-order echo ::: $(seq 1 5)
UseCase: because sometime you need order in your life.
Task: Suppose you have a script process.sh that takes two arguments, and you want to run it for different combinations of arguments in parallel.
command: parallel ./process.sh ::: arg1 arg2 arg3 ::: argA argB
This command will execute ./process.sh with all combinations of the arguments, resulting in six executions:
- ./process.sh arg1 argA
- ./process.sh arg1 argB
- ./process.sh arg2 argA
- ./process.sh arg2 argB
- ./process.sh arg3 argA
- ./process.sh arg3 argB
UseCase: I used this to test our system with different settings, revealing varying outcomes. We suspected the database might cache data, so we gave it a shot. Surprisingly, we ran ElasticSearch and Spring Boot on a modest 2 GB server, hitting an impressive average of 300+ requests per second on an EC2 instance with 30 concurrent requests. This convinced us to embrace Docker for ElasticsSearch and our apps, giving us more control over hardware and the power to maximize CPU usage.goodtimes.
How to Emulate File-Based Queue Logic for Resuming Processing?
The brilliance of a file-based task queue lies in its simplicity and dependability. Here’s a quick guide on how to set it up:
- Create a Task File: This file lists all the tasks to be processed, one per line, like a to-do list.
- Process with GNU Parallel: Run GNU Parallel, directing it to your task file. It picks up each task and executes it.
- Logging Progress: GNU Parallel keeps a progress log. If the process halts, you can resume it, and Parallel will continue from where it stopped, skipping completed tasks.
- Easy Resumption: This setup ensures that interruptions don’t require starting from scratch. Simply run the command, and it continues processing the remaining tasks.
Create a tasks file
echo "task1\ntask2\ntask3" > tasks.txt
Process with GNU Parallel
parallel --joblog progress.log < tasks.txt
An example of web scraping using a combination of tools like jq, GNU Parallel, and a file-based queue and utilising the DummyJSON website (chatGPT). Keep in mind that you can also use these tools individually for different purposes.
#!/bin/bash
Create a task file with URLs
echo "https://dummyjson.com/products/1" > tasks.txt
echo "https://dummyjson.com/products/2" >> tasks.txt
echo "https://dummyjson.com/products/3" >> tasks.txt
echo "https://dummyjson.com/products/4" >> tasks.txt
echo "https://dummyjson.com/products/5" >> tasks.txt
Create a queue file to store failed tasks
touch queue.txt
Create a CSV file with headers
echo "id,title,description,price,discountPercentage,rating,stock,brand,category" > output.csv
Function to process a task and append to CSV
function process_task() {
local url="$1"
# Download JSON response and convert to CSV
local csv_data
csv_data=$(curl -s "$url" | jq -r '.id, .title, .description, .price, .discountPercentage, .rating, .stock, .brand, .category' | tr '\n' ',')
# Check HTTP status code
local http_status
http_status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
if [ "$http_status" -eq 200 ]; then
# Append CSV data to output.csv
echo "$csv_data" >> output.csv
# Remove the processed task from tasks.txt
grep -v "$url" tasks.txt > tasks_tmp.txt
mv tasks_tmp.txt tasks.txt
else
echo "Failed: $url (HTTP Status: $http_status)"
# Append the failed task to the queue file for retry
echo "$url" >> queue.txt
fi
}
Export the function so it can be used with parallel
export -f process_task
Process tasks with GNU Parallel
cat tasks.txt | parallel -j10 process_task
Clean up JSON files
rm -f *.json
Verify that tasks.txt and queue.txt are empty
if [ ! -s tasks.txt ]; then
rm tasks.txt
fi
if [ ! -s queue.txt ]; then
rm queue.txt
fi
echo "Processing completed."
I’m pondering whether to write about how I turned a stock prediction challenge into a machine learning adventure or perhaps dive into the showdown between Java virtual threads and Spring WebFlux. Well, that’s the plan for now. Until then, jai ram ji ki.