Preview only show first 10 pages with watermark. For full document please download

Marklogic 8: Infrastructure Management Api, Flexible Replication

   EMBED


Share

Transcript

MarkLogic 8: Infrastructure Management API, Flexible Replication, Incremental Backup, and Sizing Recommendations Caio Milani November 2014 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic 8 Feature Presentations Topics Product Manager Developer Experience: Samplestack and Reference Architecture Kasey Alderete Developer Experience: Node.js and Java Client APIs, Server-side JavaScript, and Native JSON Justin Makeig REST Management API, Flexible Replication, Sizing, and Reference Hardware Architectures Caio Milani Bitemporal Jim Clark Semantics Stephen Buxton SLIDE: 2 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Agenda  Flexible Replication  Management API  Incremental Backup  Reference Hardware Architecture SLIDE: 3 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. FLEXIBLE REPLICATION Flexible Replication Customizable information sharing between systems SLIDE: 5  Enable content collaboration across numerous systems  Support directly connected or mobile users  Provide data that users need using simple configurable parameters or queries  Ensure data consistency and security with simple workflows  Even better with Bitemporal and Management API © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Intelligent Data Layer Enabling Data Collaboration  Data replicates across many databases – No need for a master data store – No need for continuous connectivity – No need to replicate all data  Consistency on edits can be handled by – Simple versioning – Check-in/outs/publish – Conflicts resolution rules – Bitemporal collections SLIDE: 6 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Users Get Only the Data They Need  Data moves based on collections, URIs, or user defined queries  User changes to settings and queries update replicated content on their laptops  Data can be transformed and filtered before replication  Security is consistent across all peers ensuring reliable data access control SLIDE: 7 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Choosing the Right Feature For the Job  Flexible Replication is a document centric solution aimed at information sharing  Flexible Replication is not intended for DR and does not preserve transaction boundaries Filter SLIDE: 8  Database Replication makes a transactionally consistent copy of the primary data in another data center aimed at DR Filter © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. How Documents Are Replicated  Flexible Replication is an asynchronous solution built on top of the Content Processing Framework (CPF) running on a task queue  Any time a target document changes its properties fragment is updated. Document updates can be pushed (to the replica) or pulled (by the replica)  For push targets, an immediate push is attempted. For pull targets, the properties are updated to reflect that the document needs to be replicated  Query-based targets typically use pull, and for scalability reasons, query-based push targets will also not have an immediate push attempt  If the task server queue is more than half full, the Master Server will not push documents to the Replica and will instead leave it for the scheduled push task SLIDE: 9 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Scheduled Tasks  Regardless of whether you configure replication as push or pull, you must create a scheduled task to periodically replicate updated content  A scheduled replication task does the following: – Moves zero-day content that existed before replication was configured – Provides a retry mechanism in the event the initial replication fails – Replicates deletes on the Master to the Replica  Replication retries are a combination of the task frequency, documents per batch and min. and max. wait retry times  Zero-day documents replicate after documents that have failed replication SLIDE: 10 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Choose What To Replicate  Documents are replicated based on domain or serialized queries  A domain may be a document, a collection of documents, or a directory  A query works as if you were replicating the results of a search  Users can manage their queries to control what gets replicated  Also can pause/restart replication in order to preserve bandwidth SLIDE: 11 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Query-Based Replication Based on Alerting  Start from with a FlexRep config  Create a query-based target by passing in a user id  Then use alerting API to manage the user’s queries, and any matching documents will be replicated to the target SLIDE: 12 cfg = flexrep:configurationcreate() flexrep:target-create() admin:group-add-scheduled-task() flexrep:configuration-target-setuser-id() alert:make-rule(…. xdmp:user(“me"), cts:word-query("apple")…) flexrep:pull-create() © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Modify Documents Before and After Replication  Flexible Replication supports filters that can modify the content, URI, properties, collections, permissions, or anything else about the document  Filters can help deciding which documents to replicate and which not to, and which documents should have only pieces replicated  Or even wholly transform the content as part of the replication, using something like an XSLT stylesheet to automatically adjust from one schema to another  Filters work on master outbound data and/or replica inbound data SLIDE: 13 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Multi-Master  Each database can be a master for its Updates own documents sets and transmit updates to remote servers Reads Updates  A database can be a master for some content and replica for another  A database can transitively replicate to additional data centers Domain/Query Application Replication SLIDE: 14 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Ownership and Conflicts  In cases of conflict, the master by default “wins” Implementation Example Logic of a virtual lock using custom code on outbound/ inbound filters but filters and custom code can assist with more sophisticated conflict handling  Filters can be used to modify document's properties creating virtual locks (example)  Or filters can move documents along collections: “pending”, “merging”, “conflicted” to enable automatic or manual resolution  This is a proven solution deployed in critical operations SLIDE: 15 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Scale and Collaborate  Scalability to thousands of Core clusters systems can be achieved by a tiered architecture  Core clusters replicate to regional clusters that replicate to personal databases Regional clusters  Modifications on personal databases can be cascaded back to core clusters and redistributed globally SLIDE: 16 Personal Databases © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MANAGEMENT API Management API REST-based API to manage all MarkLogic capabilities SLIDE: 18  Increase efficiency and agility by automating timeconsuming repetitive tasks across production, testing and development  Reduce setup time and admin error by orchestrating multi-step configurations and deployments  Fit more seamlessly into IT environments by using REST interfaces unlike CLI or proprietary APIs  Perform automated testing and monitor performance using market tools that support REST  Even better with Client REST API, Elasticity © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Adaptive to Every Environment  Stateless HTTP calls adapt to changing datacenter topologies unlike CLI and socket based APIs  Use filtering and property parameters to scope endpoint HTTP calls and reduce client-side code  Format payloads and outputs to either HTML, JSON, or API XML, adapting to different scripting technics  Control access to endpoints with the manage-user(GET, HEAD) and manage-admin roles  Manage simultaneous requests with built in concurrency and lock control, avoiding partial or erroneous updates SLIDE: 19 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Script All Operations in MarkLogic 8 Topologies Databases, forests, groups, application servers, clusters coupling and decoupling Security Users, roles, amps, privileges, and external security HA/DR Local failover, database and flexible replication Backup and Storage Backup and restore, Tiered storage, CPF configuration Configuration SQL views, re/index, merge, bitemporal, inference operations Deployment Host bootstrap manipulation, restart and shutdown operations, packaging SLIDE: 20 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. From Read-Only to Full Control  MarkLogic 5: exposed read-only APIs for status and configuration information  MarkLogic 7: exposed cluster, host and forest-level interfaces sufficient for standing up a cluster  MarkLogic 8: exposing almost all other configuration/management tasks that can be accomplished via GUI, with minor exceptions SLIDE: 21 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. General Pattern of Endpoints Http /manage/(v2|latest)/ Descritption JSON or XML Output/Input GET resource-type returns a list of the resources Yes POST resource-type accepts a “properties” flavor and creates a resource of that type. Yes GET resource-type/name returns a description of the resource Yes DELETE resource-type/name deletes the resource N/A POST resource-type/name performs an operation on that resource Yes GET resourcetype/name/properties returns a description of the resource in a “properties” flavor. Property representations are generally replayable. Yes PUT resourcetype/name/properties accepts a “properties” flavor and modifies the resource accordingly Yes SLIDE: 22 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. General Pattern of Parameters and Headers  Request parameters –  – –  –  – format? Request headers: accept? Acceptable format: –  JSON XML Acceptable content types: content-type Response headers: – application/xml – application/json content-type – application/x-www-form-urlencoded On endpoints that support both content negotiation via accept headers and a format parameter, format parameter will override the accept headers. SLIDE: 23 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Example: Payloads for POST { "admin-username" : "adminuser", "admin-password" : "mypassword", "realm" : "public" } adminuser mypassword public SLIDE: 24 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Example: Checking Backup Status Post resource-type/name using XML script and JSON payload $payload-status :='{"operation": "backup-status", "job-id" : "' || $backup-jobid || '","host-name": "' || $backuphostname || '"}' $status-response := xdmp:httppost("http://localhost:8002/manage/v2/databases/test-db?format=json", {$payload-status} application/json application/json SLIDE: 25 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Example: Adding a Host to a Cluster curl -X POST -d "" http://${JOINING_HOST}:8001/admin/v1/init JOINER_CONFIG=`curl -s -S -X GET -H "Accept: application/xml“ http://${JOINING_HOST}:8001/admin/v1/server-config` curl -s -S --digest --user admin:password -X POST-o cluster-config.zip -d "group=Default“--data-urlencode "server-config=${JOINER_CONFIG}“-H "Content-type: application/x-www-form-urlencoded“ http://${BOOTSTRAP_HOST}:8001/admin/v1/cluster-config curl -s -S -X POST -H "Content-type: application/zip“--data-binary @./cluster-config.zip http://${JOINING_HOST}:8001/admin/v1/cluster-config SLIDE: 26 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Example: Adding Flexrep Configuration on a Master POST resource-type/name/properties XML script and JSON payload $payload := '{"domain-name": "marklogic-com-domain-2","alerting-uri": "http://marklogic.com/org/uri"}‘ $response := xdmp:http-post("http://localhost:8002/manage/v2/databases/flxrepmaster-db/flexrep-configs?format=json", … {$payload} application/json application/json SLIDE: 27 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. INCREMENTAL BACKUP Incremental Backup Faster backups while using less storage  Store only changes since the previous full or incremental backup  Consume less storage for backup copies  Reduce backup window  Improve availability with multiple daily backups  Work with Log Archiving to enable fine-grained point-in-time recovery INCREMENTAL BACKUP (delta/differential) FULL SUNDAY SLIDE: 29 FULL MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY SUNDAY © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Uncompromised Data Resiliency  Reduce Recovery Point Objective Journal Frames (RPO) with incremental backup and journal archiving Archived Journals Active Journal  Perform point-in-time recovery to Journal Frames With Timestamps Full or Incremental Backup overcome garbage-in problems Garbage in  Simple operation as server restores Restore timestamp in journal backup set and replays the journal starting from given timestamp FULL BACKUP SLIDE: 30 INCREMENTAL BACKUP INCREMENTAL BACKUP INCREMENTAL BACKUP INCREMENTAL BACKUP © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. INCREMENTAL BACKUP (delta/differential) Smaller, Faster, and More Consistent FULL FULL INCREMENTAL BACKUP (cumulative)  Store only data that changed since last full for faster restores  Store changes since last incremental FULL SUNDAY FULL MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY SUNDAY (deltas) for faster backups and less space  Shorter validation as subsequent incrementals do not examine the full backup  Backup and restore are transactional and guarantee a consistent view of the data 1-Full Backup Validation Phase Copy Phase Sync Phase End Transaction Begin Transaction 2-Incremental 3-Subsequent TIME SLIDE: 31 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Distributed Backups and Restores  Database backup and restore operations are distributed  All data nodes in a cluster participate  Backup and restore provide consistent database-level backups and restores SLIDE: 32 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Backup Directory Structure    SLIDE: 33 When you back up a database, you specify a backup directory Incremental backups are stored in their own directory Supports either a shared or unshared directory (same path must exist on each data node) Example:  In this example, the backup directory is /abc/backup and the incremental backup directory is /abc/incremental /abc/backups 20140801-1223942093224 (full backup on 8/1) /abc/incremental 20140801-1223942093224 20140802 331006226070 (incremental backup on 8/2) 20140803 341007528950 (incremental backup on 8/3) © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Flexibility to Select Data to Backup  By default you backup everything: – The configuration files – The Security database, including all of its forests – The Schemas database, including all of its forests – All of the forests of the database you are backing up  If you back up all forests, you will have a backup that you can restore to the exact same state as when the backup begins copying files  You can also backup individual forests, choosing the ones you need. Forest-level backups are consistent for the data in the forest SLIDE: 34 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Consistent Database-Level Backups and Restores  Backup and restore operations are transactional and guarantee a consistent view of the data  Data changes after copy begins are not reflected in the backup or restore set  Backup and restore operations do not lock the database  Database and Forest administrative tasks such as drop, clear, and delete cannot take place during a backup; any such operation is queued up and will initiate after the backup transaction has completed SLIDE: 35 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Phases of Backup and Restore Operation Copy Phase Validation Phase Sync Phase Begin Transaction Validation Phase   Checks for needed files and directories and if they are writable and valid For backup operations, they are checked for sufficient disk space SLIDE: 36 End Transaction Copy Phase    The files are actually copied to or from the backup directory The config files are copied at the beginning and a timestamp is written Starts a transaction; if the transaction fails on a restore, the database remains unchanged Synchronization Phase  Deletes temporary files  Leaves the database in a consistent state  On a restore, it also takes the old version of the database offline and replaces it with the newly restored version © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Summary of Incremental Backup  Since an incremental backup takes less time than a full backup, it is possible to schedule frequent incremental backups (for example, by the hour)  A full backup and a series of incremental backups can allow you to recover from a situation where a database has been lost  Incremental backup can be used with or without journal archiving  If you enable both incremental backup and journal archiving, you can replay the journal starting from the last incremental backup timestamp  Incremental backups are recommended for large databases that would take long to backup in full mode SLIDE: 37 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Backup/Restore Operations with Journal Archiving  Journal Archiving enables restore to a specific point in time between backups with the input of a wall clock time  When journal archiving is enabled, journal frames are written to backup directories by near synchronously streaming from the active journal  When journal archiving is enabled, you will experience longer restore times and slightly increased system load as a result of the streaming of journal frames  Performance can be tuned by adjusting the lag limit, the amount of time in which journal frames can differ from the frames streamed to the backup journal  After incremental backup, journals can automatically purged to save space SLIDE: 38 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. REFERENCE HARDWARE ARCHITECTURE Reference Hardware Architecture With some direct recommendations, you will know exactly how many nodes you will need for your data to ensure you achieve optimal performance for your applications at the lowest cost. SLIDE: 40 PERFORMANCE CAPACITY 100% INDEXED 1% INDEXED © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Sizing Forests of Indexed Content 100 GB/Forest 8M Docs/Forest PERFORMANCE      SLIDE: 41 High Performance Many Facets/Range Indexes (~10) Sub Second High Number of Concurrent Requests Positions 500 GB/Forest 100 M Docs/Forest CAPACITY  High capacity  Fewer Concurrent Requests  Archive/Repository/Analytics © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Indexed Content Versus Non-Indexed Content Database Records Small Text Files 100% indexed 100% INDEXED SLIDE: 42 Media Binaries Metadata only 1% Indexed 1% INDEXED © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Ready to Wear: High Performance/High Capacity  Minimum number of hosts and forests per host remains constant – 3 host cluster, 6 primary forests, 6 replica forests per host on commodity hardware  Size of forests shift depending upon where you are on the High Performance/High Capacity spectrum SLIDE: 43 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Ready to Wear: High Performance  Storage: 20 2.5’’ 15K 600 Gb drives – RAID 10, striping plus mirroring  Use Case: Search Application – Multiple facets (range indexes) – Large number of concurrent users – Subsecond queries – Will require smaller forests with fewer documents per forest SLIDE: 44 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Ready to Wear: High Capacity  Storage: 20 2.5’’ 10K 1200 Gb drives – RAID 50, striping plus parity  Use Case: Data Warehouse, Large Scale Analytics – Smaller number of concurrent users – Batch report processing that can run offline – Forests can get much larger SLIDE: 45 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Hardware/Sizing Recommendations 2U 25 SFF Chassis 2 Socket 8 Core/2.8Ghz 10GB Network 128GB – 256GB RAM 2 2GB RAID Cards 22 10K 900GB Data Drives SLIDE: 46 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Hardware/Sizing Recommendations 2U 25 SFF Chassis 32 Threads @ 2Ghz 1GB/Sec IO to Network 4GB/8GB per Thread 1GB/Sec IO to Disks 300GB/Forest + Temp, Binaries, Logs SLIDE: 47 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Example 3 Node Clusters (All HA) Archival, eDiscovery Metadata Search, Media Store Mid-Density Database HighPerformance RAID50 RAID50 RAID10 RAID10 4TB Indexed 2TB Indexed 22TB Indexed 9TB Indexed • 6TB Online • 16TB Nearline • 20TB Binaries SLIDE: 48 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Best Practices: Ancillary Database Placement  Replicate Security, Triggers, Modules, Schemas, Meters  Critical to replicate Security and Modules; multiple copies are good  When upgrading, masters should all be on ONE HOST in the cluster  Meters needs multiple forests at scale SLIDE: 49 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Best Practices: Huge Pages  Transparent Huge Pages: enabled by default in RHEL 6. Instead, disable THP and configure Huge Pages instead.  Should be set to 3/8 physical memory  Swap should be equal to size of physical memory minus huge pages SLIDE: 50 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Best Practices: Local Disk Replicas  6 Replicas per Host – Ingestion: still need background merges for replicas – Essentially doubles the size of the forests: now we have a copy of all documents in a replica forest – 2x the size forests, 2x the number of forests – Another way of saying this: non-HA is ½ of HA – But don’t do that… SLIDE: 51 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Design Patterns for High Availability  6 Primary, 6 Replica per Host  Distribute across hosts—don’t want to be in a situation where we’re not sharing load evenly in failover situation  Easiest to add 3 hosts at a time and use same distribution pattern; you can add one or two, but you will need to use forest migration SLIDE: 52 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Three Host Cluster: Starting Configuration Host 1 f1 f2 f3 f4 f5 f6 Security Modules Triggers Schema Meters SLIDE: 53 r-7 r-8 r-9 r-16 r-17 r-18 Host 2 f7 f8 f9 f10 f11 f12 r-13 r-14 r-15 r-4 r-5 r-6 r-sec1 r-mod1 Host 3 f13 f14 f15 f16 f17 f18 r-1 r-2 r-3 r-10 r-11 r-12 r-sec2 r-mod2 © COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.