Journal Overview, Sizing and Best Practice
Every write to a protected virtual machine is copied by Zerto Virtual Replication. The write continues to be processed normally on the protected site and the copy is sent asynchronously to the recovery site and written to a journal managed by a Virtual Replication Appliance (VRA). Each protected virtual machine has its own journal.
In addition to the writes, every few seconds all journals are updated with a checkpoint time-stamp. Checkpoints are used to ensure write order fidelity and crash-consistency. A recovery can be done to the last checkpoint or to a user-selected, crash-consistent checkpoint. This enables recovering the virtual machines, either to the last crash-consistent point-in-time or, for example, when the virtual machine is attacked by a virus, to a point-in-time before the virus attack.
Data and checkpoints are written to the journal until the specified journal history size is reached, which is the optimum situation. At this point, as new writes and checkpoints are written to the journal, the older writes are written to the recovery virtual machine’s virtual disks. When specifying a checkpoint to recover to, the checkpoint must still be present in the journal. For example, if the value specified is 24 hours, then recovery can be made to any checkpoint up to 24 hours in the past. After the time specified, the mirror virtual disk volumes maintained by the VRA are updated.
During recovery, the virtual machines at the recovery site are created and the recovery disks for each virtual machine, managed by the VRA, are attached to the recovered virtual machines. Information in the journal is promoted to the virtual machines to bring them up to the date and time of the selected checkpoint. To improve the RTO during recovery, the virtual machines can be used even before the journal data has been fully promoted. Every request is analyzed and the response is returned from the virtual machine directly or, if the information in the journal is more up-to-date, it comes from the journal. This continues until the recovery site’s virtual environment is fully restored to the selected checkpoint.
Each protected virtual machine has its own dedicated journal, consisting of one or more volumes. A dedicated journal enables journal data to be maintained, even when changing the host for the recovery. The default datastore, or storage, used for a journal when protecting to a vCenter Server or a SCVMM is the storage used for recovery of each virtual machine. Thus for example, if protected virtual machines in a VPG are configured with different recovery storage, the journal data is by default stored for each virtual machine on that virtual machine recovery storage. The default datastore used for a journal when protecting to a vCloud Director is the datastore with the most free space, that has either been defined as a journal datastore for the provider vDC, in the Configure Provider vDCs dialog or any datastore visible to the recovery host if the journal datastore was not defined in the Configure Provider vDCs dialog.
Defining the Journal
The journals are defined as part of the virtual protection group (VPG) definition. The definition includes VPG-level settings and VM-level settings.
VPG-Level Settings
The VPG-level journal settings are the default settings for all VMs within the VPG. They are set in the Create VPG wizard in the Replication tab.
Parameter | Description | Select/Enter a Value | ||||
---|---|---|---|---|---|---|
Journal History | The time that all write commands are saved in the journal. The longer the information is saved in the journal, the more space is required for each journal in the VPG. |
|
||||
|
||||||
The storage used for the journal data for each virtual machine in the VPG. Note: This field is not relevant when replicating to a vCD recovery site. |
The storage accessible to the host. |
|||||
Default Journal Datastore (vSphere, Azure) (not relevant when replicating to vCD) | The datastore used for the journal data for each virtual machine in the VPG. Note: This field is not relevant when replicating to a vCD recovery site. |
The datastore accessible to the host. |
||||
Journal Size Hard Limit |
The maximum size that the journal can grow, either as a percentage or a fixed amount. The journal is always thin-provisioned. Note: The Journal Size Hard Limit applies independently to both the Journal History and also to the Scratch Journal Volume. |
|
||||
|
||||||
|
||||||
Journal Size Warning Threshold | The size of the journal that triggers a warning that the journal is nearing its hard limit. |
|
||||
|
||||||
|
||||||
*The values of Size and Percentage must be less than the configured Journal Size Hard Limit so that the warning will be generated when needed. In addition to the warning threshold, Zerto will issue a message when the free space available for the journal is almost full. |
Parameter | Description | Select/Enter a Value | |
---|---|---|---|
Default Scratch Journal Storage
(Hyper-V)
|
The storage to be used by the scratch journal to stores all I/O information performed by the test virtual machine during recovery operation.
|
To change the default, specify a host and then select the storage location accessible by this host to be used as the default scratch journal storage. When you select specific scratch journal storage, the scratch journals for each virtual machine in the VPG are stored in this storage, regardless of where the recovery storage is for each virtual machine, or where the recovery datastore is for each journal. | |
Default Scratch Journal Datastore |
The datastore to be used by the scratch journal to stores all I/O information performed by the test virtual machine during recovery operation.. |
To change the default, specify a host and then select one of the datastores accessible by this host to be used as the default scratch journal datastore. When you select a specific scratch journal datastore, the scratch journals for each virtual machine in the VPG are stored in this datastore, regardless of where the recovery datastore is for each virtual machine, or where the recovery datastore is for each journal. | |
Scratch Journal Hard Limit | The maximum size that the scratch journal can grow, either as a percentage or a fixed amount. | Unlimited: The size of the scratch journal is unlimited and it can grow to the size of the recovery storage/datastore. If Unlimited is selected, Size and Percentage options are not displayed. |
|
Size (GB): The maximum scratch journal size in GB.
|
|||
Percentage: The percentage of the virtual machine volume size to which the scratch journal can grow.
|
|||
Scratch Journal Warning Threshold | The size of the scratch journal that triggers a warning that the scratch journal is nearing its hard limit. | Size* (GB): The size in GB that will generate a warning. | |
Percentage*: The percentage of the virtual machine volume size that will generate a warning. | |||
*The values of Size and Percentage must be less than the configured Scratch Journal Size Hard Limit so that the warning will be generated when needed. In addition to the warning threshold, Zerto will issue a message when the free space available for the scratch journal is almost full. |
Tip: Zerto recommends that the journal storage should be accessible by all the recovery hosts and not just by one of the hosts.
Note: Changes to the VPG-level settings are not applied to virtual machines already defined in the VPG, only to virtual machines added to the VPG after the changes.
VM-Level Settings
You can modify VM-level settings for one or more VMs to override the VPG-level settings for the journal for each VM in the journal. They are set in the Replication Tab by selecting Advanced VM Settings.
Parameter | Description | Select/Enter a Value |
---|---|---|
Recovery Host (Hyper-V) (not relevant when replicating to vCD) |
(Hyper-V) The cluster or host that will host the recovered virtual machine. | |
Recovery Host
(vSphere) (not relevant when replicating to vCD) |
(vSphere) The cluster, resource pool, or host that will host the recovered virtual machine. If the site is defined in Zerto Cloud Manager, only a resource pool can be specified and the resource pool must also have been defined in Zerto Cloud Manager. For details about Zerto Cloud Manager, see Zerto Cloud Manager Administration Guide. When a resource pool is specified, Zerto checks that the resource pool capacity is enough for all the virtual machines specified in the VPG |
When a resource pool is specified, Zerto checks that the resource pool capacity is enough for all the virtual machines specified in the VPG. If a resource pool is specified and DRS is disabled for the site later on, all the resource pools are removed by VMware and recovery is to any one of the hosts in the recovery site with a VRA installed on it. All resource pool checks are made at the level of the VPG and do not take into account multiple VPGs using the same resource pool. If the resource pool CPU resources are defined as unlimited, the actual limit is inherited from the parent but if this inherited value is too small, failover, move, and failover test operations can fail, even without a warning alert being issued by Zerto Virtual Manager. |
VM Recovery Datastore (vSphere) (not relevant when replicating to vCD) | The datastore where the VMware metadata files for the virtual machine are stored, such as the VMX file. | If a cluster or resource pool is selected for the host, only datastores that are accessible by every ESX/ESXi host in the cluster or resource pool are displayed. This is also the datastore where RDM backing files for recovery volumes are located. |
Recovery Storage
(Hyper-V) (not relevant when replicating to vCD) |
Recovery Storage (Hyper-V) (not relevant when replicating to vCD) | |
The location where the metadata files for the virtual machine are stored, such as the VHDX file. | If a cluster is selected for the host, only storage that are accessible by every host in the cluster are displayed. |
Parameter | Description | Select/Enter a Value | |||
---|---|---|---|---|---|
Journal Hard Limit | The maximum size that the journal can grow, either as a percentage or a fixed amount.
|
Unlimited: The size of the journal is unlimited and it can grow to the size of the recovery storage/datastore. If Unlimited is selected, Size and Percentage options are not displayed. |
|||
Size (GB): The maximum journal size in GB.
|
|||||
Percentage: The percentage of the virtual machine volume size to which the journal can grow.
|
|||||
Journal Warning Threshold | The size of the journal that triggers a warning that the journal is nearing its hard limit. | Unlimited: The size of the journal is unlimited and it can grow to the size of the recovery storage/datastore. If Unlimited is selected, Size and Percentage options are not displayed. |
|||
Size* (GB): The size in GB that will generate a warning. | |||||
Percentage*: The percentage of the virtual machine volume size that will generate a warning. | |||||
*The values of Size and Percentage must be less than the configured Journal Size Hard Limit so that the warning will be generated when needed. In addition to the warning threshold, Zerto will issue a message when the free space available for the journal is almost full. |
Parameter | Description | Select/Enter a Value | |
---|---|---|---|
(Hyper-V)
|
The to be used by the scratch journal to stores all I/O information performed by the test virtual machine during recovery operation.
|
||
|
The to be used by the scratch journal to stores all I/O information performed by the test virtual machine during recovery operation.. |
To change the default, specify a host and then select one of the datastores accessible by this host to be used as the default scratch journal datastore. When you select a specific scratch journal datastore, the scratch journals for each virtual machine in the VPG are stored in this datastore, regardless of where the recovery datastore is for each virtual machine, or where the recovery datastore is for each journal. | |
Scratch Journal Hard Limit | The maximum size that the scratch journal can grow, either as a percentage or a fixed amount. | Unlimited: The size of the scratch journal is unlimited and it can grow to the size of the recovery storage/datastore. If Unlimited is selected, Size and Percentage options are not displayed. |
|
Size (GB): The maximum scratch journal size in GB.
|
|||
Percentage: The percentage of the virtual machine volume size to which the scratch journal can grow.
|
|||
Scratch Journal Warning Threshold | The size of the scratch journal that triggers a warning that the scratch journal is nearing its hard limit. | Size* (GB): The scratch journal size in GB that will generate a warning. | |
Percentage*: The percentage of the virtual machine volume size that will generate a warning. | |||
*The values of Size and Percentage must be less than the configured Scratch Journal Size Hard Limit so that the warning will be generated when needed. In addition to the warning threshold, Zerto will issue a message when the free space available for the scratch journal is almost full. |
Journal Behavior
After defining a VPG, the protected virtual machine disks are synced with the recovery site. After initial synchronization, each write to a protected virtual machine is copied, asynchronously sent to the recovery site, and then written to the journal of the recovery virtual machine. Every few seconds all journals are updated with a checkpoint time-stamp. The last checkpoint written to the journal is used to establish the recovery point objective (RPO).
Checkpoints are marks in the journal history that enable recovery to a specific time. All recent checkpoints are displayed in the Select Recovery Point dialog; old checkpoints are available but the time between those displayed increases as they are older. The actual history stored is measured from the present to the oldest checkpoint.
Data and checkpoints are written to the journal until the specified journal history size is reached, which is the optimum situation. At this point, as new writes and checkpoints are written to the journal, the older writes are written to the virtual machines’ recovery virtual disks.
Healthy Journal Protocol
Under normal operation, as new writes and checkpoints are written to the journal, older writes are applied to the recovery virtual machines’ recovery volumes. Checkpoints falling outside of the configured Journal History period are removed from the journal as new checkpoints are added.
Network outages and other events of a duration up to or exceeding the configured Journal History period can prevent new checkpoints from being written to the journal. Without intervention, the Journal History would contain gaps, or be depleted. In this situation, recovery operations would be degraded, or no longer be possible.
Healthy Journal Protocol automatically prevents this situation by ensuring that the journal will always retain a prescribed minimum number of hours of history, with each hour containing at least four checkpoints.
The minimum number of Journal History hours retained is calculated as:
MIN(Default_Hours_Retained , (Configured_Journal_History / 2))
Where 4 is the default number of hours.
Examples:
Default_Hours_Retained | Configured_Journal_History | Min(Default_Hours_Retained, (Configured_Journal_History / 2)) | Minimum Number of Checkpoints |
---|---|---|---|
4 | 2 | 1 | 4 |
4 | 4 | 2 | 8 |
4 | 6 | 3 | 12 |
4 | 8 | 4 | 16 |
4 | 10 | 4 | 16 |
Journal Size Changes
The journal is always thin-provisioned and the data passed from the protected site to the recovery site is saved in the journal in a compressed format. The provisioned journal size is the current size of all the journal volumes and the amount of space initially allocated is 16GB.
The journal volume extensions provisioned sizes for the first 10 volume sizes (in GB) are: 16, 32, 33, 36, 52, 76, 110, 159, 231, 335
Note: Note: The data is compressed when it is passed over the network to the recovery side and uncompressed when a recovery operation is required. This uses less space without impacting the RTO. Compression is enabled automatically when WAN compression is enabled for a VPG. When WAN compression is disabled for a VPG, the journal will also be uncompressed.
If the journal grows to approximately 80% of the provisioned journal size or less than 6GB remains free, a new volume is added to increase the journal size. Each new journal volume added is bigger than the previous volume. The journal size can increase up until a specified hard limit. If the hard limit of the journal is reduced in the VPG definition after new volumes have been added, these volumes are not removed and continue to be used if required.
In this case, the VRA will start to write data to the recovery disk to reduce the journal size until it meets the new (smaller) hard limit.
When the amount of the journal used is approximately 50% of the provisioned journal size, the biggest unused journal volume from the added volumes is marked for removal. This volume is then removed after the time equivalent to three times the amount specified for the journal history, or twenty-four hours, whichever is more if it is still not used.
In a vSphere environment:
• | With a VMFS datastore and when the VRA is on a host ESXi that is version 5.1 or higher, the journal can also reclaim unused space on a volume. |
• | Reclaiming space on a volume does not change the provisioned journal size, which is the current size of all the journal volumes. Also, unused space is not reclaimed when using NFS datastores or any storage with a host with a version less than 5.1. |
Journal Behavior When Nearing the Journal Size Hard Limit
When the size of a virtual machine journal approaches its specified hard limit, Zerto Virtual Replication starts to move data to the target disks. Once this begins, the maintained history begins to decrease. If the journal history falls below 75% of the value specified for the journal history, a warning alert is issued in the GUI. If the history falls below one hour, an error is issued. If the amount of history defined is only one hour, an error is issued if it is less than 45 minutes.
When the journal reaches the size limit, it takes precedence over the healthy journal protocol – and checkpoints will be written to the recovery disk.
If the journal is not big enough to store all the data for the time specified, as defined in the Journal Size Hard Limit parameter, the time frame for storing data is reduced. When the journal reaches the limit specified in the Journal Size Warning Threshold parameter, an alert is issued.
If a virtual machine needs a larger journal size than the hard limit specified to accommodate the journal history, you must manually adjust the size of the Journal Size Hard Limit parameter.
Journal Datastore Sizing Behavior
The size of the datastore where the journal resides must have at least 30GB free, or have 15% free space, relative to the total datastore space, whichever number of GBs is smaller.
If the available storage of the journal datastore falls below 30GB or 15% of the total datastore size:
• | The datastore itself is considered full. |
• | An error alert is issued and all writes to the journal volumes that datastore storage are blocked. |
• | Replication is halted, but history is not lost. |
• | The RPO begins to steadily increase until additional datastore space is made available. |
Examples:
• | For a large (2TB) datastore: 15% free space remaining = 307GB. |
• | The ZVM would not consider the datastore full if 307GB of free space were remaining. 30GB free space remaining would trigger an alert, as it is the smaller figure. |
For a small (100GB) datastore: 15% free space free space remaining = 15GB.
The ZVM would not consider the datastore full if 30GB of free space were remaining. 15GB free space remaining would trigger an alert, as it is the smaller figure.
When are Checkpoints in a Journal Not Usable?
• | The following action causes the journal history to be reset: |
• | Removing a volume from a virtual machine in a VPG. |
• | The journal is rebuilt starting with no history. |
• | Changing any of the following is performed transparently, using VMware Storage vMotion or Hyper-V Quick Storage Migration. Meaning, the following actions do not cause the journal to be reset: |
• | The recovery storage of a volume. |
• | The VPG journal storage. |
• | The virtual machine recovery storage. |
Determining Where to Store the Journal
Each protected virtual machine has its own dedicated journal, consisting of one or more volumes. A dedicated journal enables journal data to be maintained even when changing the target host used for recovery.
The journals for the protected virtual machines are defined as part of the VPG and, by default, are defined to reside on the same storage as the virtual machine. This can be overridden at the VPG level to allow storage tiering.
The following table shows the different journal storage options and the consequences of each.
Allow Storage Tiering | Notes | |
---|---|---|
Default Journal | No |
The journal is located on the virtual machine’s recovery storage. By default, the recovery storage for each virtual machine is the same as the virtual machine recovery storage. |
Journal Datastore / Storage for Each VPG | Yes | Specify a journal storage for each VPG. All journals for the virtual machines in the VPG are stored in this storage. |
Journal Datastore / Storage for Multiple VPGs | Yes |
Enables the use of advanced settings such as storage IO controls etc., to provide individualized service to customers by grouping VPGs by customer and assigning each group to a specific storage. This option is recommended for cloud providers. |
Journal Sizing
In general, assuming a 10% change rate per day and four hours of journal history saved, a 15GB limit is large enough to support a virtual machine with 1TB of storage. Use the Journal Sizing Tool to make better journal hard limit size estimates for each virtual machine.
Note: Assuming a constant rate of change rate per day, a four hour history requires half the space of an eight hour history. Larger histories also require longer promotion times for moves or failover operations, impacting performance. If more space is required over time than is available, warnings and errors are issued. In this case you should increase the journal’s hard limit size.
When defining a VPG, estimate the amount of storage required at the recovery site for the journal. Use the Journal Sizing Tool to estimate the size limit. The journal is thin-provisioned, therefore the actual size of the journal increases up to the maximum in order to accommodate the increasing size of the actual history.
Another way to estimate the required journal size is by trial. Create the VPG and configure the journal limit to 'unlimited'. After the VPG has collected the fully configured history, check the journal size consumed, and decide which hard limit to set based on this consumption.
Testing Considerations When Determining Journal Size
When a VPG is tested, either during a failover test or before committing a Move or Failover operation, a scratch volume is created for each virtual machine being tested. The scratch volume created uses the same size limit defined for the journal for each virtual machine.
The size limit of the scratch volume determines the length of time that you can test. Larger limits enable longer testing times if the rate of change is constant. If a small hard limit size is set for this amount of history, for example 2 – 3 hours, the scratch volume created for testing will also be small, thus limiting the time available for testing.
Note: The limit for the scratch volume cannot be increased during testing.
Thus, when considering the journal size limit you must also consider the length of time required to test the VPG. You must specify a limit for the journal accordingly, or specify that the size is unlimited, in which case it is only limited by the storage size.
Estimating a Virtual Machine Journal Size
Use the Zerto Journal Sizing Tool to:
• | Estimate the journal size hard limit that is required. |
• | Estimate the total storage size that is required for each storage used on the recovery site. |
To estimate the journal size with the Journal Sizing Tool:
1. | Open the Zerto Journal Sizing Tool.xlsx file. |
2. | Enter the data in the VMs sheet row by row for each protected virtual machine. If a virtual machine has more than one volume, use a separate row for each volume defined for the virtual machine. The Journal Sizing Tool’s data columns are: |
• | VM: The name of the protected virtual machine. |
• | VPG: The name of the VPG that includes the virtual machine. |
• | Size (GB): The size of the virtual machine volumes. |
• | Recovery Datastore/Storage: The name of the storage used to recover/replicate the source storage. |
• | Journal Datastore/Storage: The name of the storage used for the journal. |
• | Daily Change Rate (%): The expected change rate over a 24 hour period. |
• | History (hours): The amount of history, in hours, required to be saved in the journal. |
• | Required Journal (GB): The required journal size in GB, which is calculated automatically. This is the amount of space required for the journal. For space requirements for testing, continue with the next steps. |
3. | Fill in the testing considerations in the Testing Simulator sheet, row by row for each VPG, to determine additional space requirements for testing the VPG either during a failover test or before committing a move or failover operation. This sheet enables you to simulate a testing policy, testing all the protected virtual machines in parallel or one VPG at a time, or something between these two extremes. |
4. | Fill in the data in the Test Scratch Volume Definitions sheet with the testing change rate estimated over 24 hours and the time expected to be used for testing. |
• | Test Change Rate (daily): The expected change rate during the test over a 24 hour period. |
• | Test Duration (hours): The time allotted to carry out the testing. |
5. | View the results in the Datastore Usage sheet to assess the size requirements for each datastore/storage you use, both for recovery of protected data and for the additional space required when a journal resides on the same storage. |
(vSphere Environments Only) Monitoring Journal VMDK Usage
The actual size of the thin journal VMDKs created for each VM can be obtained from the datastore browser by looking at the ZeRTO Volumes folder on the datastore. When using ESXi 5.1 and later, the used size of this VMDK displays the current journal usage.
Impact on a Journal of Resynchronizing a VPG
When a VPG is resynchronized, for example after the WAN or the recovery site host is down, older checkpoints in the journal are removed. During the resynchronization, the delta changes in the source/protection site are added to the journal and older data in the journal is moved to the target/recovery virtual disk managed by the VRA for the virtual machine.
As the resynchronization continues and more old data is moved out of the journal, the checkpoints associated with the data are also removed from the journal but new checkpoints are not added to it. If the time to complete synchronization is so long that all the existing checkpoints would be removed, the last few checkpoints are kept so that recovery operations can still be performed.
Looking at the contents of a journal file after a long synchronization has completed can include a few checkpoints from before the synchronization started followed by a gap without any checkpoints and then a normal journal with checkpoints added every few seconds.
Note: Synchronizations do not usually last so long that all the checkpoints are removed from the journal.