Transcript
MarkLogic 8: Infrastructure Management API, Flexible Replication, Incremental Backup, and Sizing Recommendations Caio Milani November 2014 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic 8 Feature Presentations Topics
Product Manager
Developer Experience: Samplestack and Reference Architecture
Kasey Alderete
Developer Experience: Node.js and Java Client APIs, Server-side JavaScript, and Native JSON
Justin Makeig
REST Management API, Flexible Replication, Sizing, and Reference Hardware Architectures
Caio Milani
Bitemporal
Jim Clark
Semantics
Stephen Buxton
SLIDE: 2
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Agenda Flexible Replication Management API Incremental Backup Reference Hardware Architecture
SLIDE: 3
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
FLEXIBLE REPLICATION
Flexible Replication Customizable information sharing between systems
SLIDE: 5
Enable content collaboration across numerous systems
Support directly connected or mobile users
Provide data that users need using simple configurable parameters or queries
Ensure data consistency and security with simple workflows
Even better with Bitemporal and Management API
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Intelligent Data Layer Enabling Data Collaboration Data replicates across many databases – No need for a master data store – No need for continuous connectivity – No need to replicate all data Consistency on edits can be handled by – Simple versioning – Check-in/outs/publish – Conflicts resolution rules – Bitemporal collections SLIDE: 6
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Users Get Only the Data They Need Data moves based on collections,
URIs, or user defined queries User changes to settings and
queries update replicated content on their laptops Data can be transformed and filtered
before replication Security is consistent across all
peers ensuring reliable data access control SLIDE: 7
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Choosing the Right Feature For the Job
Flexible Replication is a document centric solution aimed at information sharing Flexible Replication is not intended for DR and does not preserve transaction boundaries Filter
SLIDE: 8
Database Replication makes a
transactionally consistent copy of the primary data in another data center aimed at DR
Filter
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
How Documents Are Replicated Flexible Replication is an asynchronous solution built on top of the Content
Processing Framework (CPF) running on a task queue Any time a target document changes its properties fragment is updated. Document
updates can be pushed (to the replica) or pulled (by the replica) For push targets, an immediate push is attempted. For pull targets, the properties
are updated to reflect that the document needs to be replicated Query-based targets typically use pull, and for scalability reasons, query-based
push targets will also not have an immediate push attempt If the task server queue is more than half full, the Master Server will not push
documents to the Replica and will instead leave it for the scheduled push task SLIDE: 9
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Scheduled Tasks Regardless of whether you configure replication as push or pull, you must create a
scheduled task to periodically replicate updated content A scheduled replication task does the following: – Moves zero-day content that existed before replication was configured – Provides a retry mechanism in the event the initial replication fails – Replicates deletes on the Master to the Replica Replication retries are a combination of the task frequency, documents per batch
and min. and max. wait retry times Zero-day documents replicate after documents that have failed replication
SLIDE: 10
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Choose What To Replicate Documents are replicated based on domain or
serialized queries A domain may be a document, a collection of
documents, or a directory
A query works as if you were replicating the results of a search
Users can manage their queries to control what gets replicated
Also can pause/restart replication in order to preserve bandwidth
SLIDE: 11
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Query-Based Replication Based on Alerting Start from with a FlexRep config
Create a query-based target by passing in a user id
Then use alerting API to manage the user’s
queries, and any matching documents will be replicated to the target
SLIDE: 12
cfg = flexrep:configurationcreate() flexrep:target-create() admin:group-add-scheduled-task() flexrep:configuration-target-setuser-id() alert:make-rule(…. xdmp:user(“me"), cts:word-query("apple")…) flexrep:pull-create()
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Modify Documents Before and After Replication Flexible Replication supports filters that can modify the content, URI, properties,
collections, permissions, or anything else about the document Filters can help deciding which documents to replicate and which not to, and which
documents should have only pieces replicated Or even wholly transform the content as part of the replication, using something
like an XSLT stylesheet to automatically adjust from one schema to another Filters work on master outbound data and/or replica inbound data
SLIDE: 13
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Multi-Master Each database can be a master for its
Updates
own documents sets and transmit updates to remote servers
Reads
Updates
A database can be a master for some
content and replica for another A database can transitively replicate to
additional data centers Domain/Query Application Replication
SLIDE: 14
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Ownership and Conflicts In cases of conflict, the master by default “wins”
Implementation Example Logic of a virtual lock using custom code on outbound/ inbound filters
but filters and custom code can assist with more sophisticated conflict handling Filters can be used to modify document's
properties creating virtual locks (example) Or filters can move documents along collections:
“pending”, “merging”, “conflicted” to enable automatic or manual resolution This is a proven solution deployed in critical
operations
SLIDE: 15
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Scale and Collaborate Scalability to thousands of
Core clusters
systems can be achieved by a tiered architecture Core clusters replicate to regional
clusters that replicate to personal databases
Regional clusters
Modifications on personal
databases can be cascaded back to core clusters and redistributed globally
SLIDE: 16
Personal Databases
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MANAGEMENT API
Management API REST-based API to manage all MarkLogic capabilities
SLIDE: 18
Increase efficiency and agility by automating timeconsuming repetitive tasks across production, testing and development
Reduce setup time and admin error by orchestrating multi-step configurations and deployments
Fit more seamlessly into IT environments by using REST interfaces unlike CLI or proprietary APIs
Perform automated testing and monitor performance using market tools that support REST
Even better with Client REST API, Elasticity
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Adaptive to Every Environment Stateless HTTP calls adapt to changing datacenter
topologies unlike CLI and socket based APIs Use filtering and property parameters to scope endpoint
HTTP
calls and reduce client-side code Format payloads and outputs to either HTML, JSON, or
API
XML, adapting to different scripting technics Control access to endpoints with the manage-user(GET,
HEAD) and manage-admin roles Manage simultaneous requests with built in concurrency
and lock control, avoiding partial or erroneous updates
SLIDE: 19
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Script All Operations in MarkLogic 8 Topologies Databases, forests, groups, application servers, clusters coupling and decoupling Security Users, roles, amps, privileges, and external security HA/DR Local failover, database and flexible replication Backup and Storage Backup and restore, Tiered storage, CPF configuration Configuration SQL views, re/index, merge, bitemporal, inference operations Deployment Host bootstrap manipulation, restart and shutdown operations, packaging
SLIDE: 20
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
From Read-Only to Full Control MarkLogic 5: exposed read-only APIs for status and configuration information MarkLogic 7: exposed cluster, host and forest-level interfaces sufficient for
standing up a cluster MarkLogic 8: exposing almost all other configuration/management tasks that can
be accomplished via GUI, with minor exceptions
SLIDE: 21
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
General Pattern of Endpoints Http
/manage/(v2|latest)/
Descritption
JSON or XML Output/Input
GET
resource-type
returns a list of the resources
Yes
POST
resource-type
accepts a “properties” flavor and creates a resource of that type.
Yes
GET
resource-type/name
returns a description of the resource
Yes
DELETE
resource-type/name
deletes the resource
N/A
POST
resource-type/name
performs an operation on that resource
Yes
GET
resourcetype/name/properties
returns a description of the resource in a “properties” flavor. Property representations are generally replayable.
Yes
PUT
resourcetype/name/properties
accepts a “properties” flavor and modifies the resource accordingly
Yes
SLIDE: 22
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
General Pattern of Parameters and Headers
Request parameters –
– – –
–
format? Request headers: accept?
Acceptable format: –
JSON XML Acceptable content types:
content-type Response headers:
–
application/xml
–
application/json
content-type
–
application/x-www-form-urlencoded
On endpoints that support both content negotiation via accept headers and a format parameter, format parameter will override the accept headers.
SLIDE: 23
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Example: Payloads for POST { "admin-username" : "adminuser", "admin-password" : "mypassword", "realm" : "public" }
adminuser mypassword public
SLIDE: 24
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Example: Checking Backup Status Post resource-type/name using XML script and JSON payload $payload-status :='{"operation": "backup-status", "job-id" : "' || $backup-jobid || '","host-name": "' || $backuphostname || '"}'
$status-response := xdmp:httppost("http://localhost:8002/manage/v2/databases/test-db?format=json",
{$payload-status} application/json application/json SLIDE: 25
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Example: Adding a Host to a Cluster curl -X POST -d "" http://${JOINING_HOST}:8001/admin/v1/init
JOINER_CONFIG=`curl -s -S -X GET -H "Accept: application/xml“ http://${JOINING_HOST}:8001/admin/v1/server-config`
curl -s -S --digest --user admin:password -X POST-o cluster-config.zip -d "group=Default“--data-urlencode "server-config=${JOINER_CONFIG}“-H "Content-type: application/x-www-form-urlencoded“ http://${BOOTSTRAP_HOST}:8001/admin/v1/cluster-config
curl -s -S -X POST -H "Content-type: application/zip“--data-binary @./cluster-config.zip http://${JOINING_HOST}:8001/admin/v1/cluster-config SLIDE: 26
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Example: Adding Flexrep Configuration on a Master POST resource-type/name/properties XML script and JSON payload $payload := '{"domain-name": "marklogic-com-domain-2","alerting-uri": "http://marklogic.com/org/uri"}‘
$response := xdmp:http-post("http://localhost:8002/manage/v2/databases/flxrepmaster-db/flexrep-configs?format=json", …
{$payload} application/json application/json
SLIDE: 27
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
INCREMENTAL BACKUP
Incremental Backup Faster backups while using less storage
Store only changes since the previous full or incremental backup
Consume less storage for backup copies
Reduce backup window
Improve availability with multiple daily backups
Work with Log Archiving to enable fine-grained point-in-time recovery INCREMENTAL BACKUP (delta/differential)
FULL
SUNDAY
SLIDE: 29
FULL
MONDAY
TUESDAY
WEDNESDAY
THURSDAY
FRIDAY
SATURDAY
SUNDAY
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Uncompromised Data Resiliency Reduce Recovery Point Objective
Journal Frames
(RPO) with incremental backup and journal archiving
Archived Journals
Active Journal
Perform point-in-time recovery to
Journal Frames With Timestamps
Full or Incremental Backup
overcome garbage-in problems Garbage in
Simple operation as server restores
Restore timestamp in journal
backup set and replays the journal starting from given timestamp FULL BACKUP
SLIDE: 30
INCREMENTAL BACKUP
INCREMENTAL BACKUP
INCREMENTAL BACKUP
INCREMENTAL BACKUP
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
INCREMENTAL BACKUP (delta/differential)
Smaller, Faster, and More Consistent
FULL
FULL
INCREMENTAL BACKUP (cumulative)
Store only data that changed since last
full for faster restores Store changes since last incremental
FULL
SUNDAY
FULL
MONDAY
TUESDAY
WEDNESDAY
THURSDAY
FRIDAY
SATURDAY
SUNDAY
(deltas) for faster backups and less space Shorter validation as subsequent
incrementals do not examine the full backup Backup and restore are transactional and
guarantee a consistent view of the data
1-Full Backup
Validation Phase
Copy Phase
Sync Phase End Transaction
Begin Transaction
2-Incremental
3-Subsequent
TIME
SLIDE: 31
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Distributed Backups and Restores Database backup and restore operations are distributed All data nodes in a cluster participate Backup and restore provide consistent database-level backups and restores
SLIDE: 32
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Backup Directory Structure
SLIDE: 33
When you back up a database, you specify a backup directory Incremental backups are stored in their own directory Supports either a shared or unshared directory (same path must exist on each data node)
Example: In this example, the backup directory is /abc/backup and the incremental backup directory is /abc/incremental /abc/backups 20140801-1223942093224 (full backup on 8/1) /abc/incremental 20140801-1223942093224 20140802 331006226070 (incremental backup on 8/2) 20140803 341007528950 (incremental backup on 8/3)
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Flexibility to Select Data to Backup By default you backup everything: – The configuration files – The Security database, including all of its forests – The Schemas database, including all of its forests – All of the forests of the database you are backing up If you back up all forests, you will have a backup that you can restore to the exact
same state as when the backup begins copying files You can also backup individual forests, choosing the ones you need. Forest-level
backups are consistent for the data in the forest SLIDE: 34
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Consistent Database-Level Backups and Restores Backup and restore operations are transactional and guarantee a consistent view
of the data Data changes after copy begins are not reflected in the backup or restore set Backup and restore operations do not lock the database Database and Forest administrative tasks such as drop, clear, and delete cannot
take place during a backup; any such operation is queued up and will initiate after the backup transaction has completed
SLIDE: 35
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Phases of Backup and Restore Operation Copy Phase
Validation Phase
Sync Phase
Begin Transaction
Validation Phase
Checks for needed files and directories and if they are writable and valid For backup operations, they are checked for sufficient disk space
SLIDE: 36
End Transaction
Copy Phase
The files are actually copied to or from the backup directory The config files are copied at the beginning and a timestamp is written Starts a transaction; if the transaction fails on a restore, the database remains unchanged
Synchronization Phase
Deletes temporary files
Leaves the database in a consistent state
On a restore, it also takes the old version of the database offline and replaces it with the newly restored version
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Summary of Incremental Backup Since an incremental backup takes less time than a full backup, it is possible to
schedule frequent incremental backups (for example, by the hour) A full backup and a series of incremental backups can allow you to recover from a
situation where a database has been lost Incremental backup can be used with or without journal archiving If you enable both incremental backup and journal archiving, you can replay the
journal starting from the last incremental backup timestamp Incremental backups are recommended for large databases that would take long
to backup in full mode
SLIDE: 37
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Backup/Restore Operations with Journal Archiving Journal Archiving enables restore to a specific point in time between backups with
the input of a wall clock time When journal archiving is enabled, journal frames are written to backup directories
by near synchronously streaming from the active journal When journal archiving is enabled, you will experience longer restore times and
slightly increased system load as a result of the streaming of journal frames Performance can be tuned by adjusting the lag limit, the amount of time in which
journal frames can differ from the frames streamed to the backup journal After incremental backup, journals can automatically purged to save space
SLIDE: 38
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
REFERENCE HARDWARE ARCHITECTURE
Reference Hardware Architecture With some direct recommendations, you will know exactly how many nodes you will need for your data to ensure you achieve optimal performance for your applications at the lowest cost.
SLIDE: 40
PERFORMANCE
CAPACITY
100% INDEXED
1% INDEXED
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Sizing Forests of Indexed Content 100 GB/Forest 8M Docs/Forest PERFORMANCE
SLIDE: 41
High Performance Many Facets/Range Indexes (~10) Sub Second High Number of Concurrent Requests Positions
500 GB/Forest 100 M Docs/Forest CAPACITY
High capacity Fewer Concurrent Requests Archive/Repository/Analytics
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Indexed Content Versus Non-Indexed Content Database Records Small Text Files 100% indexed 100% INDEXED
SLIDE: 42
Media Binaries Metadata only 1% Indexed 1% INDEXED
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Ready to Wear: High Performance/High Capacity Minimum number of hosts and forests per host remains constant – 3 host cluster, 6 primary forests, 6 replica forests per host on commodity
hardware Size of forests shift depending upon where you are on the High Performance/High
Capacity spectrum
SLIDE: 43
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Ready to Wear: High Performance Storage: 20 2.5’’ 15K 600 Gb drives – RAID 10, striping plus mirroring Use Case: Search Application –
Multiple facets (range indexes)
–
Large number of concurrent users
–
Subsecond queries
–
Will require smaller forests with fewer documents per forest
SLIDE: 44
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Ready to Wear: High Capacity Storage: 20 2.5’’ 10K 1200 Gb drives – RAID 50, striping plus parity Use Case: Data Warehouse, Large Scale Analytics –
Smaller number of concurrent users
–
Batch report processing that can run offline
–
Forests can get much larger
SLIDE: 45
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Hardware/Sizing Recommendations 2U 25 SFF Chassis 2 Socket 8 Core/2.8Ghz
10GB Network
128GB – 256GB RAM
2 2GB RAID Cards
22 10K 900GB Data Drives SLIDE: 46
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Hardware/Sizing Recommendations 2U 25 SFF Chassis 32 Threads @ 2Ghz
1GB/Sec IO to Network
4GB/8GB per Thread
1GB/Sec IO to Disks
300GB/Forest + Temp, Binaries, Logs SLIDE: 47
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Example 3 Node Clusters (All HA) Archival, eDiscovery
Metadata Search, Media Store
Mid-Density Database
HighPerformance
RAID50
RAID50
RAID10
RAID10
4TB Indexed
2TB Indexed
22TB Indexed
9TB Indexed
• 6TB Online • 16TB Nearline
• 20TB Binaries
SLIDE: 48
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Best Practices: Ancillary Database Placement Replicate Security, Triggers, Modules, Schemas, Meters Critical to replicate Security and Modules; multiple copies are good When upgrading, masters should all be on ONE HOST in the cluster Meters needs multiple forests at scale
SLIDE: 49
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Best Practices: Huge Pages Transparent Huge Pages: enabled by default in RHEL 6. Instead, disable THP
and configure Huge Pages instead. Should be set to 3/8 physical memory Swap should be equal to size of physical memory minus huge pages
SLIDE: 50
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Best Practices: Local Disk Replicas 6 Replicas per Host – Ingestion: still need background merges for replicas – Essentially doubles the size of the forests: now we have a copy of all documents
in a replica forest – 2x the size forests, 2x the number of forests – Another way of saying this: non-HA is ½ of HA – But don’t do that…
SLIDE: 51
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Design Patterns for High Availability 6 Primary, 6 Replica per Host Distribute across hosts—don’t want to be in a situation where we’re not sharing
load evenly in failover situation Easiest to add 3 hosts at a time and use same distribution pattern; you can add
one or two, but you will need to use forest migration
SLIDE: 52
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Three Host Cluster: Starting Configuration Host 1 f1 f2 f3 f4 f5 f6 Security Modules Triggers Schema Meters SLIDE: 53
r-7 r-8 r-9 r-16 r-17 r-18
Host 2 f7 f8 f9 f10 f11 f12
r-13 r-14 r-15 r-4 r-5 r-6 r-sec1 r-mod1
Host 3 f13 f14 f15 f16 f17 f18
r-1 r-2 r-3 r-10 r-11 r-12 r-sec2 r-mod2
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.