Enterprise Batch Processing with Jakarta Batch - Part 1

Photo of Luqman Saeed by Luqman Saeed

Batch processing is an integral part of enterprise applications. Reading, processing and storing vast amounts of data is mostly suited to batch processing runtimes that are optimized for such workloads. Inventory processing, payroll, report generation, invoice/statement generation, data migration, data conversion among others are all tasks that are suitable to batch processing.

Batch processing typically involves breaking data loads to be processed into smaller "chunks," which are also broken down into even smaller units for processing. The batch processing is then carried out on a single unit of the data, one at a time, without any human intervention. This makes the processing of a significantly large amount of data very efficient and fast. Batch processing can also be parallelized to take advantage of the hardware capabilities of the underlying computer. 

This blog series will show you how to create batch processing tasks on the Jakarta EE Platform. The blog is broken down into a series of posts, each blog covering a specific part of the batch specification. This blog introduces you to the Jakarta Batch specification and gives a high level overview of what constitutes a batch task.


2560px-Jakarta_ee_logo_schooner_color_stacked_default.svg

Jakarta Batch

On the Jakarta EE platform,Jakarta Batch is the standard batch processing specification for creating simple to sophisticated batch processing tasks. To create a batch job, you should first understand how Jakarta Batch is structured. Let's start by looking at the various components.

Batch Job

A batch job in Jakarta Batch is an encompassing instruction that contains everything to run a given batch task. For instance, a bank may have the task of sending client statements at the end of every week. This task can be encapsulated, or described in a Jakarta Batch job. This job will have a name - eg. MailClientStatements, the steps needed to get the job done, whether some exceptions should be skipped, any listeners, job level properties and anything that the job needs to run. Remember a typical batch job runs end-to-end with no human intervention. So a job describes everything in advance that the given batch job will need. 

A job is described in XML using the Jakarta Job Specification Language. The MailClientStatements job can be described as in the following XML file.

<?xml version="1.0" encoding="UTF-8"?>
<job id="mail-client-statements" xmlns="https://jakarta.ee/xml/ns/jakartaee"
     version="2.0">
    <properties>
        <property name="propertyName" value="propertyValue"/>
        <property name="anotherProperty" value="anotherValue"/>
        <property name="defaultFileType" value="pdf"/>
    </properties>
    <listeners>
        <listener ref="jobLevelListener"></listener>
    </listeners>
    
    <step id="computeBalance" next="createFile">
        <properties>
            <property name="firstStepProperty" value="firstStepValue"/>
        </properties>
        <listeners>
            <listener ref="firstStepListener"></listener>
        </listeners>
        <chunk>
            <reader ref="myReader"></reader>
            <processor ref="myProcessor"></processor>
            <writer ref="myWriter"></writer>
        </chunk>
    </step>
    
    <step id="createFile">
       
        <batchlet ref="myBatchlet"></batchlet>
    </step>
    
    <step id="mailStatement">
        
        <end on="COMPPLETED"/>
    </step>
</job>

A job descriptor, as shown above, is first identified by an id, in this example, mail-client-statements. This is the unique ID of the job across all other batch jobs. The XML file can be called anything. But the convention is to call the file by the name of the job, so in this case, the XML file containing the above job descriptor is in a file called mail-client-statements.xml. For applications packaged as WAR files, the job descriptor files are found under WEB-INF/classes/META-INF/batch-jobs and META-INF/batch-jobs for applications packaged as JAR files. 

A job can have job-level or global properties, that will be available to all units specified in the job. In the above example, the job-level properties element defines two properties, propertyName, anotherProperty and defaultFileType. These two will be available to all specified units of the job. A batch job will be carried out primarily as a collection of one or more steps that each carry out an atomic unit of the larger job. 

Step

A step is a logical unit of work in a job. A step can be any self-contained unit of work that should be carried out as part of the larger job. In our mail customer statement job example, a give step can be one where client balances are computed at a specified date. Another step can be reading the customer statement for a given period and converting to a PDF file. Yet another step can be take the created file in the previous step and mailing it to the customer. Another auxiliary step could be reading the same PDF file and uploading to some document storage service.

All these steps are self-contained units of work that can be described separately in a batch job. A step must have an ID, and optionally a next attribute that specifies which step comes after the execution of the current step. A step can also have properties, which will be scoped just to units in that step. A step can also have step specific listeners, which again, will be scoped to that parituclar step. The mail customer statement job above has three steps - computeBalance, createFile and mailStatement steps. 

The computeBalance step has its next set to createFile. This means when the execution of the computeBalance is completed, the batch runtime will traverse to executing the createFile step. The computeBalance step defines its own property called firstStepProperty with the value firstStepValue. This property will only be available to artefacts in this step. The step also defines a listener called firstStepListner.  For the actual processing of carrying out the unit of a job in a step, a step can have either a chunk or a task. 

Chunk

A chunk is one of two atomic units of work that can be carried out within a step. The other one is a task (covered later). A chunk specifies a unit of work that can be carried out on a given number of items. Remember a batch job, at its most atomic level, processes batch items one at a time. A chunk is used to specify the two required, and one optional operations that can be carried out on a single item in the batch processing. These are reader, writer and optionally, processor. The computeBalance step in the mail client statement batch description above has a chunk that specifies all three operations. A chunk can have a number of attributes that configure how the batch runtime manages each instance of the chunk. We will cover these attributes in detail in a future installment of this blog series. For now let's look at the three operations that can be carried out in a chunk, starting with the reader.

Reader

A reader, in Jakarta Batch, is a Java class that implements the jakarta.batch.api.chunk.ItemReader interface. This interface has four methods that must be implemented. However, if the given batch job does not need all four methods, the reader can extend the jakarta.batch.api.chunk.AbstractItemReader class, which has empty placeholder implementations for three of the methods. Extending this class, a given reader will only have to implement the readItem() method from the ItemReader interface. The MyReader class for the mail client statement batch is shown below. 

@Named
public class MyReader extends AbstractItemReader {

    @Override
    public Object readItem() throws Exception {
        return null;
    }
}

The @Named Jakarta CDI qualifier makes this class available for reference in the Expression Language used to define the batch job. By default, without specifying any name, the @Named annotation makes the annotated class available in the EL space as myReader, as used in the job description above. The readItem method is used to read a unit of whatever data that is being processed. In our mail client statement batch job, the reader can be used to read the client statement for each client. A reader reads a single item, one at a time. By default the method returns java.lang.Object. You can change this to any concrete Java type in your application.

The reader can be implemented to read the batch items from anywhere- database, file (JSON, XML etc and transformed), another module or microservice, externally from another system. The batch runtime doesn't really prescribe where the data will come from. It only specifies that an item should be returned. It will keep calling the readItem method, as long as the method returns a non-null value. A null value signals to the batch runtime the end of items to read. And thus no need to call the reader to read anything. 

Where does the returned item go, you ask? Depending on how you configured your chunk, it go to the processor, or straight to the writer. In the mail client statement job, it goes to the processor.

Processor

A processor in Jakarta Batch is a Java class that implements the jakarta.batch.api.chunk.ItemProcessor. This interface has a single method - processItem(Object item) - that takes the item returned from the reader. Remember the reader returns one batch item at a time. This item is automatically passed as a parameter to the processItem method. This method can then do any kind of business specific processing on the returned item. In our mail client statements batch job for example, the processor can compute the balance of a client based on the client account received. This balance can be based on a specific date range. 

Jakarta Batch leaves the actual detail of how a given item is processed to your business domain. All it does is take the item from the reader and hands it over to your processor for processing. The MyProcessor class for the mail client statement batch job description is shown below.

@Named
public class MyProcessor implements ItemProcessor {
    @Override
    public Object processItem(Object item) throws Exception {
        return null;
    }
}

Similar to the reader, the processor also returns a unit of the processed item. As stated above, the processor is free to "process" the received item from the reader according to the business domain. Similar to the reader, the processor can return null to signify to the batch runtime that the given item as received from the reader should not be part of the next steps of the batch processing. This allows the processor to filter out items according to the business domain requirements.

For instance, the  mail client statement batch processor can filter out client accounts that have not had any banking transaction in the last six months. The processor can return null for a received client account that falls within that category. Alternatively, the processor can return a File object for each client that has had a bank transaction on their account in the last three months, returning null for clients that didn't have any. Again you might ask, but where does the returned item from the processor go? It goes to the writer.

Writer

A writer in Jakarta Batch is any Java class that implements the jakarta.batch.api.chunk.ItemWriter interface. This interface, similar to the ItemReader interface, has four methods. But for brevity, a writer can extend the jakarta.batch.api.chunk.AbstractItemWriter class instead. This class only requires the implementation of the writeItems(List<Object>items) method. The MyWriter used in the mail client statement batch description is shown below.

@Named
public class MyWriter extends AbstractItemWriter {
    @Override
    public void writeItems(List<Object> items) throws Exception {

    }
}

The writeItems method takes a list of objects. These objects represent a collection of the concrete Java type passed from the reader to the processor and ultimately to the writer. But why does the writeItem method take a list? The batch runtime is designed to read a single item, process that single item and then collect those items into a list up to a predefined count. When this count is hit, the list is passed to the writer. This list could be empty if the processor filters out all items, though this is not very common. 

The writer can do whatever writing means within the given business context. In the mail client statement, if the processor returned a File object for each processed client, the writer will receive a collection of Files for "witing," where in this example, writing could mean uploading said files to a given storage area.

Summary

In this first in a series of Jakarta Batch blog posts, we have covered the basic structure and components of what makes a batch. We looked at defining a batch job, what a chunk is, what a reader, processor and writer are and what they do. In the next installment of this blog series, we will take a detail look at configuring the chunk. This will help us answer questions like what determines the number of elements passed to the writer? What happens if there is an exception thrown? Stay tuned for the next installment!

In the meantime, find out more about different Jakarta EE specifications:

Comments