Stamp V 0 84 Keygen Software
KEYGENS.PRO - the site that generates cracks and keygens online. Just enter the name of software to unlock. After keygen generating it can be downloaded for free.
Your cluster’s operation can hiccup because of any of a myriad set of reasons from bugs in HBase itself through misconfigurations — misconfiguration of HBase but also operating system misconfigurations — through to hardware problems whether it be a bug in your network card drivers or an underprovisioned RAM bus (to mention two recent examples of hardware issues that manifested as 'HBase is slow'). You will also need to do a recalibration if up to this your computing has been bound to a single box. Here is one good starting point:. Supported In the context of Apache HBase, /supported/ means that HBase is designed to work in the way described, and deviation from the defined behavior or functionality should be reported as a bug. Not Supported In the context of Apache HBase, /not supported/ means that a use case or use pattern is not expected to work and should be considered an antipattern. If you think this designation should be reconsidered for a given feature or use pattern, file a JIRA or start a discussion on one of the mailing lists. Tested In the context of Apache HBase, /tested/ means that a feature is covered by unit or integration tests, and has been proven to work as expected.
Not Tested In the context of Apache HBase, /not tested/ means that a feature or use pattern may or may not work in a given way, and may or may not corrupt your data or cause operational issues. It is an unknown, and there are no guarantees. If you can provide proof that a feature designated as /not tested/ does work in a given way, please submit the tests and/or the metrics so that other users can gain certainty about such features or use patterns. Getting Started. $ tar xzvf hbase-3.0.0-SNAPSHOT-bin.tar.gz $ cd hbase-3.0.0-SNAPSHOT/ • You are required to set the JAVA_HOME environment variable before starting HBase.
You can set the variable via your operating system’s usual mechanism, but HBase provides a central mechanism, conf/hbase-env.sh. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for your operating system. The JAVA_HOME variable should be set to a directory which contains the executable file bin/java. Most modern Linux operating systems provide a mechanism, such as /usr/bin/alternatives on RHEL or CentOS, for transparently switching between versions of executables such as Java. In this case, you can set JAVA_HOME to the directory containing the symbolic link to bin/java, which is usually /usr.
JAVA_HOME=/usr • Edit conf/hbase-site.xml, which is the main HBase configuration file. At this time, you only need to specify the directory on the local filesystem where HBase and ZooKeeper write data. By default, a new directory is created under /tmp. Many servers are configured to delete the contents of /tmp upon reboot, so you should store the data elsewhere. The following configuration will store HBase’s data in the hbase directory, in the home directory of the user called testuser.
Paste the tags beneath the tags, which should be empty in a new HBase install. The hbase.rootdir in the above example points to a directory in the local filesystem. The 'file:/' prefix is how we denote local filesystem. To home HBase on an existing instance of HDFS, set the hbase.rootdir to point at a directory up on your instance: e.g.
For more on this variant, see the section below on Standalone HBase over HDFS. • The bin/start-hbase.sh script is provided as a convenient way to start HBase.
Issue the command, and if all goes well, a message is logged to standard output showing that HBase started successfully. You can use the jps command to verify that you have one running process called HMaster. In standalone mode HBase runs all daemons within this single JVM, i.e. The HMaster, a single HRegionServer, and the ZooKeeper daemon. Go to to view the HBase Web UI.
After working your way through standalone mode, you can re-configure HBase to run in pseudo-distributed mode. Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and ZooKeeper) runs as a separate process: in standalone mode all daemons ran in one jvm process/instance. By default, unless you configure the hbase.rootdir property as described in, your data is still stored in /tmp/. In this walk-through, we store your data in HDFS instead, assuming you have HDFS available. You can skip the HDFS configuration to continue storing your data in the local filesystem.
The HMaster server controls the HBase cluster. You can start up to 9 backup HMaster servers, which makes 10 total HMasters, counting the primary. To start a backup HMaster, use the local-master-backup.sh.
For each backup master you want to start, add a parameter representing the port offset for that master. Each HMaster uses three ports (16010, 16020, and 16030 by default). The port offset is added to these ports, so using an offset of 2, the backup HMaster would use ports 16012, 16022, and 16032. The following command starts 3 backup servers using ports 2/16032, 3/16033, and 5/16035. The HRegionServer manages the data in its StoreFiles as directed by the HMaster.
Generally, one HRegionServer runs per node in the cluster. Running multiple HRegionServers on the same system can be useful for testing in pseudo-distributed mode. The local-regionservers.sh command allows you to run multiple RegionServers. It works in a similar way to the local-master-backup.sh command, in that each parameter you provide represents the port offset for an instance. Each RegionServer requires two ports, and the default ports are 16020 and 16030.
However, the base ports for additional RegionServers are not the default ports since the default ports are used by the HMaster, which is also a RegionServer since HBase version 1.0.0. The base ports are 16200 and 16300 instead. You can run 99 additional RegionServers that are not a HMaster or backup HMaster, on a server. The following command starts four additional RegionServers, running on sequential ports starting at 2 (base ports 0 plus 2).
HBase Configuration File Descriptions backup-masters Not present by default. A plain-text file which lists hosts on which the Master should start a backup Master process, one host per line. Hadoop-metrics2-hbase.properties Used to connect HBase Hadoop’s Metrics2 framework. See the for more information on Metrics2. Contains only commented-out examples by default. Hbase-env.cmd and hbase-env.sh Script for Windows and Linux / Unix environments to set up the working environment for HBase, including the location of Java, Java options, and other environment variables.
The file contains many commented-out examples to provide guidance. Hbase-policy.xml The default policy configuration file used by RPC servers to make authorization decisions on client requests.
Only used if HBase is enabled. Hbase-site.xml The main HBase configuration file. This file specifies configuration options which override HBase’s default configuration.
You can view (but do not edit) the default configuration file at docs/hbase-default.xml. You can also view the entire effective configuration for your cluster (defaults and overrides) in the HBase Configuration tab of the HBase Web UI. Log4j.properties Configuration file for HBase logging via log4j. Regionservers A plain-text file containing a list of hosts which should run a RegionServer in your HBase cluster.
By default this file contains the single entry localhost. It should contain a list of hostnames or IP addresses, one per line, and should only contain localhost if each node in your cluster will run a RegionServer on its localhost interface. Operating System Utilities ssh HBase uses the Secure Shell (ssh) command and utilities extensively to communicate between cluster nodes. Each server in the cluster must be running ssh so that the Hadoop and HBase daemons can be managed. You must be able to connect to all nodes via SSH, including the local node, from the Master as well as any backup Master, using a shared key rather than a password. You can see the basic methodology for such a set-up in Linux or Unix systems at '. If your cluster nodes use OS X, see the section, on the Hadoop wiki.
DNS HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work in versions of HBase previous to 0.92.0. The tool can be used to verify DNS is working correctly on the cluster. The project README file provides detailed instructions on usage. Loopback IP Prior to hbase-0.96.0, HBase only used the IP address 127.0.0.1 to refer to localhost, and this was not configurable. See for more details.
NTP The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism on your cluster and that all nodes look to the same service for time synchronization. See the at The Linux Documentation Project (TLDP) to set up NTP.
Limits on Number of Files and Processes (ulimit) Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to 1024 (or 256 on older versions of OS X).
You can check this limit on your servers by running the command ulimit -n when logged in as the user which runs HBase. See for some of the problems you may experience if the limit is too low. You may also notice errors such as the following. Configuring the maximum number of file descriptors and processes for the user who is running the HBase process is an operating system configuration, rather than an HBase configuration. It is also important to be sure that the settings are changed for the user that actually runs HBase.
To see which user started HBase, and that user’s ulimit configuration, look at the first line of the HBase log for that instance. A useful read setting config on your hadoop cluster is Aaron Kimball’s Configuration Parameters: What can you just ignore? Because HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its lib directory.
The bundled jar is ONLY for use in standalone mode. In distributed mode, it is critical that the version of Hadoop that is out on your cluster match what is under HBase. Replace the hadoop jar found in the HBase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch issues.
Make sure you replace the jar in HBase across your whole cluster. Hadoop version mismatch issues have various manifestations but often all look like its hung. Description Comma separated list of servers in the ZooKeeper ensemble (This config.
Should have been named hbase.zookeeper.ensemble). For example, 'host1.mydomain.com,host2.mydomain.com,host3.mydomain.com'. By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper ensemble servers.
If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which hbase will start/stop ZooKeeper on as part of cluster start/stop. Client-side, we will take this list of ensemble members and put it together with the hbase.zookeeper.property.clientPort config. And pass it into zookeeper constructor as the connectString parameter.
Description Split the call queues into read and write queues. The specified interval (which should be between 0.0 and 1.0) will be multiplied by the number of call queues.
A value of 0 indicate to not split the call queues, meaning that both read and write requests will be pushed to the same set of queues. A value lower than 0.5 means that there will be less read queues than write queues. A value of 0.5 means there will be the same number of read and write queues. A value greater than 0.5 means that there will be more read queues than write queues. A value of 1.0 means that all the queues except one are used to dispatch read requests. Example: Given the total number of call queues being 10 a read.ratio of 0 means that: the 10 queues will contain both read/write requests. A read.ratio of 0.3 means that: 3 queues will contain only read requests and 7 queues will contain only write requests.
A read.ratio of 0.5 means that: 5 queues will contain only read requests and 5 queues will contain only write requests. A read.ratio of 0.8 means that: 8 queues will contain only read requests and 2 queues will contain only write requests. A read.ratio of 1 means that: 9 queues will contain only read requests and 1 queues will contain only write requests. Description Given the number of read call queues, calculated from the total number of call queues multiplied by the callqueue.read.ratio, the scan.ratio property will split the read call queues into small-read and long-read queues. A value lower than 0.5 means that there will be less long-read queues than short-read queues. A value of 0.5 means that there will be the same number of short-read and long-read queues.
A value greater than 0.5 means that there will be more long-read queues than short-read queues A value of 0 or 1 indicate to use the same set of queues for gets and scans. Example: Given the total number of read call queues being 8 a scan.ratio of 0 or 1 means that: 8 queues will contain both long and short read requests.
A scan.ratio of 0.3 means that: 2 queues will contain only long-read requests and 6 queues will contain only short-read requests. A scan.ratio of 0.5 means that: 4 queues will contain only long-read requests and 4 queues will contain only short-read requests.
A scan.ratio of 0.8 means that: 6 queues will contain only long-read requests and 2 queues will contain only short-read requests. Description ZooKeeper session timeout in milliseconds.
It is used in two different ways. First, this value is used in the ZK client that HBase uses to connect to the ensemble. It is also used by HBase when it starts a ZK server and it is passed as the 'maxSessionTimeout'.
For example, if an HBase region server connects to a ZK ensemble that’s also managed by HBase, then the session timeout will be the one specified by this configuration. But, a region server that connects to an ensemble managed with a different configuration will be subjected that ensemble’s maxSessionTimeout. So, even though HBase might propose using 90 seconds, the ensemble can have a max timeout lower than this and it will take precedence.
The current default that ZK ships with is 40 seconds, which is lower than HBase’s. Description Number of rows that we try to fetch when calling next on a scanner if it is not served from (local, client) memory. This configuration works together with hbase.client.scanner.max.result.size to try and use the network efficiently.
The default value is Integer.MAX_VALUE by default so that the network will fill the chunk size defined by hbase.client.scanner.max.result.size rather than be limited by a particular number of rows since the size of rows varies table to table. If you know ahead of time that you will not require more than a certain number of rows from a scan, this configuration should be set to that row limit via Scan#setCaching. Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer times when the cache is empty. Do not set this value such that the time between invocations is greater than the scanner timeout; i.e. Description If FlushLargeStoresPolicy is used and there are multiple column families, then every time that we hit the total memstore limit, we find out all the column families whose memstores exceed a 'lower bound' and only flush them while retaining the others in memory. The 'lower bound' will be 'hbase.hregion.memstore.flush.size / column_family_number' by default unless value of this property is larger than that. If none of the families have their memstore size more than lower bound, all the memstores will be flushed (just as usual).
Description If the memstores in a region are this size or larger when we go to close, run a 'pre-flush' to clear out memstores before we put up the region closed flag and take the region offline. On close, a flush is run under the close flag to empty memory. During this time the region is offline and we are not taking on any writes. If the memstore content is large, this flush could take a long time to complete. The preflush is meant to clean out the bulk of the memstore before putting up the close flag and taking the region offline so the flush that runs under the close flag has little to do.
Description Time between major compactions, expressed in milliseconds. Set to 0 to disable time-based automatic major compactions. User-requested and size-based major compactions will still run. This value is multiplied by hbase.hregion.majorcompaction.jitter to cause compaction to start at a somewhat-random time during a given window of time. The default value is 7 days, expressed in milliseconds. If major compactions are causing disruption in your environment, you can configure them to run at off-peak times for your deployment, or disable time-based major compactions by setting this parameter to 0, and run major compactions in a cron job or by another external mechanism.
Description The minimum number of StoreFiles which must be eligible for compaction before compaction can run. The goal of tuning hbase.hstore.compaction.min is to avoid ending up with too many tiny StoreFiles to compact. Setting this value to 2 would cause a minor compaction each time you have two StoreFiles in a Store, and this is probably not appropriate.
If you set this value too high, all the other values will need to be adjusted accordingly. For most cases, the default value is appropriate. In previous versions of HBase, the parameter hbase.hstore.compaction.min was named hbase.hstore.compactionThreshold. Description A StoreFile (or a selection of StoreFiles, when using ExploringCompactionPolicy) smaller than this size will always be eligible for minor compaction. HFiles this size or larger are evaluated by hbase.hstore.compaction.ratio to determine if they are eligible. Because this limit represents the 'automatic include' limit for all StoreFiles smaller than this value, this value may need to be reduced in write-heavy environments where many StoreFiles in the 1-2 MB range are being flushed, because every StoreFile will be targeted for compaction and the resulting StoreFiles may still be under the minimum size and require further compaction. If this parameter is lowered, the ratio check is triggered more quickly.
This addressed some issues seen in earlier versions of HBase but changing this parameter is no longer necessary in most situations. Default: 128 MB expressed in bytes. Description For minor compaction, this ratio is used to determine whether a given StoreFile which is larger than hbase.hstore.compaction.min.size is eligible for compaction. Its effect is to limit compaction of large StoreFiles.
The value of hbase.hstore.compaction.ratio is expressed as a floating-point decimal. A large ratio, such as 10, will produce a single giant StoreFile. Conversely, a low value, such as.25, will produce behavior similar to the BigTable compaction algorithm, producing four StoreFiles. A moderate value of between 1.0 and 1.4 is recommended.
When tuning this value, you are balancing write costs with read costs. Raising the value (to something like 1.4) will have more write costs, because you will compact larger StoreFiles.
However, during reads, HBase will need to seek through fewer StoreFiles to accomplish the read. Consider this approach if you cannot take advantage of Bloom filters. Otherwise, you can lower this value to something like 1.0 to reduce the background cost of writes, and use Bloom filters to control the number of StoreFiles touched during reads. For most cases, the default value is appropriate.
Description There are two different thread pools for compactions, one for large compactions and the other for small compactions. This helps to keep compaction of lean tables (such as hbase:meta) fast. If a compaction is larger than this threshold, it goes into the large compaction pool. In most cases, the default value is appropriate.
Default: 2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size (which defaults to 128MB). The value field assumes that the value of hbase.hregion.memstore.flush.size is unchanged from the default. Description The number of cells scanned in between heartbeat checks. Heartbeat checks occur during the processing of scans to determine whether or not the server should stop scanning in order to send back a heartbeat message to the client. Heartbeat messages are used to keep the client-server connection alive during long running scans.
Small values mean that the heartbeat checks will occur more often and thus will provide a tighter bound on the execution time of the scan. Larger values mean that the heartbeat checks occur less frequently. Description When a server is configured to require secure connections, it will reject connection attempts from clients using SASL SIMPLE (unsecure) authentication. This setting allows secure servers to accept SASL SIMPLE connections from clients when the client requests.
When false (the default), the server will not allow the fallback to SIMPLE authentication, and will reject the connection. WARNING: This setting should ONLY be used as a temporary measure while converting clients over to secure authentication. It MUST BE DISABLED for secure operation. Description If set to true (the default), HBase verifies the checksums for hfile blocks. HBase writes checksums inline with the data when it writes out hfiles. HDFS (as of this writing) writes checksums to a separate file than the data file necessitating extra seeks. Setting this flag saves some on i/o.
Checksum verification by HDFS will be internally disabled on hfile streams when this flag is set. If the hbase-checksum verification fails, we will switch back to using HDFS checksums (so do not disable HDFS checksums! And besides this feature applies to hfiles only, not to WALs). If this parameter is set to false, then hbase will not verify any checksums, instead it will depend on checksum verification being done in the HDFS client. Description A comma-separated list of regular expressions used to match against an HTTP request’s User-Agent header when protection against cross-site request forgery (CSRF) is enabled for REST server by setting hbase.rest.csrf.enabled to true. If the incoming User-Agent matches any of these regular expressions, then the request is considered to be sent by a browser, and therefore CSRF prevention is enforced.
If the request’s User-Agent does not match any of these regular expressions, then the request is considered to be sent by something other than a browser, such as scripted automation. In this case, CSRF is not a potential attack vector, so the prevention is not enforced. This helps achieve backwards-compatibility with existing automation that has not been updated to send the CSRF prevention header. Description If this setting is enabled and ACL based access control is active (the AccessController coprocessor is installed either as a system coprocessor or on a table as a table coprocessor) then you must grant all relevant users EXEC privilege if they require the ability to execute coprocessor endpoint calls. EXEC privilege, like any other permission, can be granted globally to a user, or to a user on a per table or per namespace basis. For more information on coprocessor endpoints, see the coprocessor section of the HBase online manual. For more information on granting or revoking permissions using the AccessController, see the security section of the HBase online manual.
Description The period (in milliseconds) for refreshing the store files for the secondary regions. 0 means this feature is disabled. Secondary regions sees new files (from flushes and compactions) from primary once the secondary region refreshes the list of files in the region (there is no notification mechanism).
But too frequent refreshes might cause extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL (hbase.master.hfilecleaner.ttl) the requests are rejected.
Configuring HFile TTL to a larger value is also recommended with this setting. Description Whether asynchronous WAL replication to the secondary region replicas is enabled or not. If this is enabled, a replication peer named 'region_replica_replication' will be created which will tail the logs and replicate the mutations to region replicas for tables that have region replication >1. If this is enabled once, disabling this replication also requires disabling the replication peer using shell or ReplicationAdmin java class.
Replication to secondary region replicas works over standard inter-cluster replication. Description By default, in replication we can not make sure the order of operations in slave cluster is same as the order in master.
If set REPLICATION_SCOPE to 2, we will push edits by the order of written. This configure is to set how long (in ms) we will wait before next checking if a log can not push right now because there are some logs written before it have not been pushed.
A larger waiting will decrease the number of queries on hbase:meta but will enlarge the delay of replication. This feature relies on zk-less assignment, so users must set hbase.assignment.usezk to false to support it. Hbase.zookeeper.quorum example1,example2,example3 The directory shared by RegionServers.
Hbase.zookeeper.property.dataDir /export/zookeeper Property from ZooKeeper config zoo.cfg. The directory where the snapshot is stored. Hbase.rootdir hdfs://example0:8020/hbase The directory shared by RegionServers. Hbase.cluster.distributed true The mode the cluster will be in.
Possible values are false: standalone and pseudo-distributed setups with managed ZooKeeper true: fully-distributed with unmanaged ZooKeeper Quorum (see hbase-env.sh). The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery. You might need to tune the timeout down to a minute or even less so the Master notices failures sooner.
Before changing this value, be sure you have your JVM garbage collection configuration under control, otherwise, a long garbage collection that lasts beyond the ZooKeeper session timeout will take out your RegionServer. (You might be fine with this — you probably want recovery to start on the server if a RegionServer has been in GC for a long period of time). The reason why it is dangerous to keep this setting high is that the aggregate size of all the puts that are currently happening in a region server may impose too much pressure on its memory, or even trigger an OutOfMemoryError.
A RegionServer running on low memory will trigger its JVM’s garbage collector to run more frequently up to a point where GC pauses become noticeable (the reason being that all the memory used to keep all the requests' payloads cannot be trashed, no matter how hard the garbage collector tries). After some time, the overall cluster throughput is affected since every request that hits that RegionServer will take longer, which exacerbates the problem even more.
HBase also has a limit on the number of WAL files, designed to ensure there’s never too much data that needs to be replayed during recovery. This limit needs to be set according to memstore configuration, so that all the necessary data would fit. It is recommended to allocate enough WAL files to store at least that much data (when all memstores are close to full). For example, with 16Gb RS heap, default memstore settings (0.4), and default WAL file size (~60Mb), 16Gb*0.4/60, the starting point for WAL file count is ~109. However, as all memstores are not expected to be full all the time, less WAL files can be allocated. Instead of allowing HBase to split your regions automatically, you can choose to manage the splitting yourself.
This feature was added in HBase 0.90.0. Manually managing splits works if you know your keyspace well, otherwise let HBase figure where to split for you. Manual splitting can mitigate region creation and movement under load. It also makes it so region boundaries are known and invariant (if you disable region splitting).
If you use manual splits, it is easier doing staggered, time-based major compactions to spread out your network IO load. Determine the Optimal Number of Pre-Split Regions The optimal number of pre-split regions depends on your application and environment. A good rule of thumb is to start with 10 pre-split regions per server and watch as data grows over time. It is better to err on the side of too few regions and perform rolling splits later. The optimal number of regions depends upon the largest StoreFile in your region.
The size of the largest StoreFile will increase with time if the amount of data grows. The goal is for the largest region to be just large enough that the compaction selection algorithm only compacts it during a timed major compaction. Otherwise, the cluster can be prone to compaction storms with a large number of regions under compaction at the same time. It is important to understand that the data growth causes compaction storms and not the manual split decision. Do not turn off block cache (You’d do it by setting hfile.block.cache.size to zero).
Currently we do not do well if you do this because the RegionServer will spend all its time loading HFile indices over and over again. If your working set is such that block cache does you no good, at least size the block cache such that HFile indices will stay up in the cache (you can get a rough idea on the size you need by surveying RegionServer UIs; you’ll see index block size accounted near the top of the webpage). The issue is messy but has a bunch of good discussion toward the end on low timeouts and how to cause faster recovery including citation of fixes added to HDFS. Read the Varun Sharma comments.
The below suggested configurations are Varun’s suggestions distilled and tested. Make sure you are running on a late-version HDFS so you have the fixes he refers to and himself adds to HDFS that help HBase MTTR (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 — Hadoop 2 for sure has them and late Hadoop 1 has some). Set the following in the RegionServer. Dfs.client.socket-timeout 10000 Down the DFS timeout from 60 to 10 seconds.
Dfs.datanode.socket.write.timeout 10000 Down the DFS timeout from 8 * 60 to 10 seconds. Ipc.client.connect.timeout 3000 Down from 60 seconds to 3. Ipc.client.connect.max.retries.on.timeouts 2 Down from 45 seconds to 3 (2 == 3 retries).
Dfs.namenode.avoid.read.stale.datanode true Enable stale state in hdfs dfs.namenode.stale.datanode.interval 20000 Down from default 30 seconds dfs.namenode.avoid.write.stale.datanode true Enable stale state in hdfs. JMX (Java Management Extensions) provides built-in instrumentation that enables you to monitor and manage the Java VM. To enable monitoring and management from remote systems, you need to set system property com.sun.management.jmxremote.port (the port number through which you want to enable JMX RMI connections) when you start the Java VM. See the for more information.
Historically, besides above port mentioned, JMX opens two additional random TCP listening ports, which could lead to port conflict problem. (See for details). Only a subset of all configurations can currently be changed in the running server. Here are those configurations: Table 3. Summary • A patch upgrade is a drop-in replacement. Any change that is not Java binary and source compatible would not be allowed.
[] Downgrading versions within patch releases may not be compatible. • A minor upgrade requires no application/client code modification.
Ideally it would be a drop-in replacement but client code, coprocessors, filters, etc might have to be recompiled if new jars are used. • A major upgrade allows the HBase community to make breaking changes. Compatibility Matrix [] Major Minor Patch Client-Server wire Compatibility N Y Y Server-Server Compatibility N Y Y File Format Compatibility N [] Y Y Client API Compatibility N Y Y Client Binary Compatibility N N Y Server-Side Limited API Compatibility Stable N Y Y Evolving N N Y Unstable N N N Dependency Compatibility N Y Y Operational Compatibility N N Y. • Public: safe for end users and external projects • LimitedPrivate: used for internals we expect to be pluggable, such as coprocessors • Private: strictly for use within HBase itself Classes which are defined as IA.Private may be used as parameters or return values for interfaces which are declared IA.LimitedPrivate. Treat the IA.Private object as opaque; do not try to access its methods or fields directly.
• InterfaceStability (): describes what types of interface changes are permitted. Possible values include. Odd/Even Versioning or 'Development' Series Releases Ahead of big releases, we have been putting up preview versions to start the feedback cycle turning-over earlier. These 'Development' Series releases, always odd-numbered, come with no guarantees, not even regards being able to upgrade between two sequential releases (we reserve the right to break compatibility across 'Development' Series releases).
Needless to say, these releases are not for production deploys. They are a preview of what is coming in the hope that interested parties will take the release for a test drive and flag us early if we there are issues we’ve missed ahead of our rolling a production-worthy release. Binary Compatibility When we say two HBase versions are compatible, we mean that the versions are wire and binary compatible. Compatible HBase versions means that clients can talk to compatible but differently versioned servers. It means too that you can just swap out the jars of one version and replace them with the jars of another, compatible version and all will just work. Unless otherwise specified, HBase point versions are (mostly) binary compatible. You can safely do rolling upgrades between binary compatible versions; i.e.
Across point versions: e.g. From 0.94.5 to 0.94.6. See link:[Does compatibility between versions also mean binary compatibility?] discussion on the HBase dev mailing list.
Rollback vs Downgrade This section describes how to perform a rollback on an upgrade between HBase minor and major versions. In this document, rollback refers to the process of taking an upgraded cluster and restoring it to the old version while losing all changes that have occurred since upgrade. By contrast, a cluster downgrade would restore an upgraded cluster to the old version while maintaining any data written since the upgrade. We currently only offer instructions to rollback HBase clusters.
Further, rollback only works when these instructions are followed prior to performing the upgrade. Replication Unless you are doing an all-service rollback, the HBase cluster will lose any configured peers for HBase replication. If your cluster is configured for HBase replication, then prior to following these instructions you should document all replication peers.
After performing the rollback you should then add each documented peer back to the cluster. For more information on enabling HBase replication, listing peers, and adding a peer see. Note also that data written to the cluster since the upgrade may or may not have already been replicated to any peers. Determining which, if any, peers have seen replication data as well as rolling back the data in those peers is out of the scope of this guide. Configurable Locations The instructions below assume default locations for the HBase data directory and the HBase znode. Both of these locations are configurable and you should verify the value used in your cluster before proceeding.
In the event that you have a different value, just replace the default with the one found in your configuration * HBase data directory is configured via the key 'hbase.rootdir' and has a default value of '/hbase'. * HBase znode is configured via the key 'zookeeper.znode.parent' and has a default value of '/hbase'.
HBase Default Ports Changed The ports used by HBase changed. They used to be in the 600XX range. In HBase 1.0.0 they have been moved up out of the ephemeral port range and are 160XX instead (Master web UI was 60010 and is now 16010; the RegionServer web UI was 60030 and is now 16030, etc.). If you want to keep the old port locations, copy the port setting configs from hbase-default.xml into hbase-site.xml, change them back to the old values from the HBase 0.98.x era, and ensure you’ve distributed your configurations before you restart. Hbase.bucketcache.percentage.in.combinedcache configuration has been REMOVED You may have made use of this configuration if you are using BucketCache. If NOT using BucketCache, this change does not affect you.
Its removal means that your L1 LruBlockCache is now sized using hfile.block.cache.size — i.e. The way you would size the on-heap L1 LruBlockCache if you were NOT doing BucketCache — and the BucketCache size is not whatever the setting for hbase.bucketcache.size is.
You may need to adjust configs to get the LruBlockCache and BucketCache sizes set to what they were in 0.98.x and previous. If you did not set this config., its default value was 0.9. If you do nothing, your BucketCache will increase in size by 10%. Your L1 LruBlockCache will become hfile.block.cache.size times your java heap size ( hfile.block.cache.size is a float between 0.0 and 1.0). To read more, see. Mismatch Of hbase.client.scanner.max.result.size Between Client and Server If either the client or server version is lower than 0.98.11/1.0.0 and the server has a smaller value for hbase.client.scanner.max.result.size than the client, scan requests that reach the server’s hbase.client.scanner.max.result.size are likely to miss data.
Korg Mini Kaoss Pad 2 Manual Español more. In particular, 0.98.11 defaults hbase.client.scanner.max.result.size to 2 MB but other versions default to larger values. For this reason, be very careful using 0.98.11 servers with any other client version. Tables Processed: hdfs://localhost:41020/myHBase/.META. • Namespaces: HBase 0.96.0 has support for namespaces. The upgrade needs to reorder directories in the filesystem for namespaces to work.
• ZNodes: All znodes are purged so that new ones can be written in their place using a new protobuf’ed format and a few are migrated in place: e.g. Replication and table state znodes • WAL Log Splitting: If the 0.94.x cluster shutdown was not clean, we’ll split WAL logs as part of migration before we startup on 0.96.0. This WAL splitting runs slower than the native distributed WAL splitting because it is all inside the single upgrade process (so try and get a clean shutdown of the 0.94.0 cluster if you can). You can’t go back! To move to 0.92.0, all you need to do is shutdown your cluster, replace your HBase 0.90.x with HBase 0.92.0 binaries (be sure you clear out all 0.90.x instances) and restart (You cannot do a rolling restart from 0.90.x to 0.92.x — you must restart). On startup, the.META. Table content is rewritten removing the table schema from the info:regioninfo column.
Also, any flushes done post first startup will write out data in the new 0.92.0 file format,. This means you cannot go back to 0.90.x once you’ve started HBase 0.92.0 over your HBase data directory. MSLAB is ON by default In 0.92.0, the flag is set to true (See ). In 0.90.x it was false. When it is enabled, memstores will step allocate memory in MSLAB 2MB chunks even if the memstore has zero or just a few small elements. This is fine usually but if you had lots of regions per RegionServer in a 0.90.x cluster (and MSLAB was off), you may find yourself OOME’ing on upgrade because the thousands of regions * number of column families * 2MB MSLAB (at a minimum) puts your heap over the top. Set hbase.hregion.memstore.mslab.enabled to false or set the MSLAB size down from 2MB by setting hbase.hregion.memstore.mslab.chunksize to something less.
Memory accounting is different now In 0.92.0, indices and bloom filters take up residence in the same LRU used caching blocks that come from the filesystem. In 0.90.x, the HFile v1 indices lived outside of the LRU so they took up space even if the index was on a ‘cold’ file, one that wasn’t being actively used. With the indices now in the LRU, you may find you have less space for block caching.
Adjust your block cache accordingly. See the for more detail. The block size default size has been changed in 0.92.0 from 0.2 (20 percent of heap) to 0.25. Changes in HBase replication 0.92.0 adds two new features: multi-slave and multi-master replication. The way to enable this is the same as adding a new peer, so in order to have multi-master you would just run add_peer for each cluster that acts as a master to the other slave clusters. Collisions are handled at the timestamp level which may or may not be what you want, this needs to be evaluated on a per use case basis.
Replication is still experimental in 0.92 and is disabled by default, run it at your own risk. HFile v2 and the “Bigger, Fewer” Tendency 0.92.0 stores data in a new format,.
As HBase runs, it will move all your data from HFile v1 to HFile v2 format. This auto-migration will run in the background as flushes and compactions run. HFile v2 allows HBase run with larger regions/files. In fact, we encourage that all HBasers going forward tend toward Facebook axiom #1, run with larger, fewer regions. If you have lots of regions now — more than 100s per host — you should look into setting your region size up after you move to 0.92.0 (In 0.92.0, default size is now 1G, up from 256M), and then running online merge tool (See ). Getting an exit code of 0 means that the command you scripted definitely succeeded.
However, getting a non-zero exit code does not necessarily mean the command failed. The command could have succeeded, but the client lost connectivity, or some other event obscured its success. This is because RPC commands are stateless.
The only way to be sure of the status of an operation is to check. For instance, if your script creates a table, but returns a non-zero exit value, you should check whether the table was actually created before trying again to create it. $./hbase shell./sample_commands.txt 0 row(s) in 3.4170 seconds TABLE test 1 row(s) in 0.0590 seconds 0 row(s) in 0.1540 seconds 0 row(s) in 0.0080 seconds 0 row(s) in 0.0060 seconds 0 row(s) in 0.0060 seconds ROW COLUMN+CELL row1 column=cf:a, timestamp=968, value=value1 row2 column=cf:b, timestamp=997, value=value2 row3 column=cf:c, timestamp=007, value=value3 row4 column=cf:d, timestamp=015, value=value4 4 row(s) in 0.0420 seconds COLUMN CELL cf:a timestamp=968, value=value1 1 row(s) in 0.0110 seconds 0 row(s) in 1.5630 seconds 0 row(s) in 0.4360 seconds.
# generate splits for long (Ruby fixnum) key range from start to end key hbase(main): 070: 0>def gen_splits(start_key,end_key,num_regions) hbase(main): 071: 1>results= [] hbase(main): 072: 1>range=end_key-start_key hbase(main): 073: 1>incr=(range/num_regions).floor hbase(main): 074: 1>for i in 1. Num_regions- 1 hbase(main): 075: 2>results.push([i*incr+start_key].pack( ' N ')) hbase(main): 076: 2>end hbase(main): 077: 1>return results hbase(main): 078: 1>end hbase(main): 079: 0>hbase(main): 080: 0>splits=gen_splits( 1, 2000000, 10) =>[ ' 000 003 r @ ', ' 000 006 032 177 ', ' 000 t ' 276 ', ' 000 f 4 375 ', ' 000 017 B create ' test_splits ', ' f ',SPLITS=>splits 0 row(s) in 0.2670 seconds =>Hbase::Table - test_splits. HBase Data Model Terminology Table An HBase table consists of multiple rows. Row A row in HBase consists of a row key and one or more columns with values associated with them. Rows are sorted alphabetically by the row key as they are stored.
For this reason, the design of the row key is very important. The goal is to store data in such a way that related rows are near each other.
A common row key pattern is a website domain. If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain. Column A column in HBase consists of a column family and a column qualifier, which are delimited by a: (colon) character.
Column Family Column families physically colocate a set of columns and their values, often for performance reasons. Each column family has a set of storage properties, such as whether its values should be cached in memory, how its data is compressed or its row keys are encoded, and others.
Each row in a table has the same column families, though a given row might not store anything in a given column family. Column Qualifier A column qualifier is added to a column family to provide the index for a given piece of data. Given a column family content, a column qualifier might be content:html, and another might be content:pdf. Though column families are fixed at table creation, column qualifiers are mutable and may differ greatly between rows. Cell A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version. Timestamp A timestamp is written alongside each value, and is the identifier for a given version of a value.
By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell. The following example is a slightly modified form of the one on page 2 of the paper. There is a table called webtable that contains two rows ( com.cnn.www and com.example.www) and three column families named contents, anchor, and people. In this example, for the first row ( com.cnn.www), anchor contains two columns ( anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column ( contents:html). This example contains 5 versions of the row with the row key com.cnn.www, and one version of the row with the row key com.example.www. The contents:html column qualifier contains the entire HTML of a given website.
Qualifiers of the anchor column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link. The people column family represents people associated with the site. By convention, a column name is made of its column family prefix and a qualifier. For example, the column contents:html is made up of the column family contents and the html qualifier. The colon character (:) delimits the column family from the column family qualifier. Table webtable Row Key Time Stamp ColumnFamily contents ColumnFamily anchor ColumnFamily people 'com.cnn.www' t9 anchor:cnnsi.com = 'CNN' 'com.cnn.www' t8 anchor:my.look.ca = 'CNN.com' 'com.cnn.www' t6 contents:html = '' 'com.cnn.www' t5 contents:html = '' 'com.cnn.www' t3 contents:html = '' 'com.example.www' t5 contents:html = '' people:author = 'John Doe'. Although at a conceptual level tables may be viewed as a sparse set of rows, they are physically stored by column family.
A new column qualifier (column_family:column_qualifier) can be added to an existing column family at any time. ColumnFamily anchor Row Key Time Stamp Column Family anchor 'com.cnn.www' t9 anchor:cnnsi.com = 'CNN' 'com.cnn.www' t8 anchor:my.look.ca = 'CNN.com' Table 7. ColumnFamily contents Row Key Time Stamp ColumnFamily contents: 'com.cnn.www' t6 contents:html = '' 'com.cnn.www' t5 contents:html = '' 'com.cnn.www' t3 contents:html = ''. The empty cells shown in the conceptual view are not stored at all.
Thus a request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.ca value at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned. Given multiple versions, the most recent is also the first one found, since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from timestamp t6, the value of anchor:cnnsi.com from timestamp t9, the value of anchor:my.look.ca from timestamp t8. Columns in Apache HBase are grouped into column families. All column members of a column family have the same prefix.
For example, the columns courses:history and courses:math are both members of the courses column family. The colon character (:) delimits the column family from the column family qualifier. The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up and running. Deletes work by creating tombstone markers. For example, let’s suppose we want to delete a row.
For this you can specify a version, or else by default the currentTimeMillis is used. What this means is delete all cells where the version is less than or equal to this version.
HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values.
When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted. Delete markers are purged during the next major compaction of the store, unless the KEEP_DELETED_CELLS option is set in the column family (See ). To keep the deletes for a configurable amount of time, you can set the delete TTL via the hbase.hstore.time.to.purge.deletes property in hbase-site.xml. If hbase.hstore.time.to.purge.deletes is not set, or set to 0, all delete markers, including those with timestamps in the future, are purged during the next major compaction.
Otherwise, a delete marker with a timestamp in the future is kept until the major compaction which occurs after the time represented by the marker’s timestamp plus the value of hbase.hstore.time.to.purge.deletes, in milliseconds. Deletes mask puts, even puts that happened after the delete was entered.
Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything ⇐ T. After this you do a new put with a timestamp ⇐ T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run.
These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond. However, that doesn’t mean that equivalent join functionality can’t be supported in your application, but you have to do it yourself. The two primary strategies are either denormalizing the data upon writing to HBase, or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS' demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs. So which is the best approach? It depends on what you are trying to do, and as such there isn’t a single answer that works for every use case.
• Aim to have regions sized between 10 and 50 GB. • Aim to have cells no larger than 10 MB, or 50 MB if you use. Otherwise, consider storing your cell data in HDFS and store a pointer to the data in HBase. • A typical schema has between 1 and 3 column families per table. HBase tables should not be designed to mimic RDBMS tables.
• Around 50-100 regions is a good number for a table with 1 or 2 column families. Remember that a region is a contiguous segment of a column family. • Keep your column family names as short as possible. The column family names are stored for every value (ignoring prefix encoding).
They should not be self-documenting and descriptive like in a typical RDBMS. • If you are storing time-based machine data or logging information, and the row key is based on device ID or service ID plus time, you can end up with a pattern where older data regions never have additional writes beyond a certain age. In this type of situation, you end up with a small number of active regions and a large number of older regions which have no new writes. For these situations, you can tolerate a larger number of regions because your resource consumption is driven by the active regions only. • If only one column family is busy with writes, only that column family accomulates memory. Be aware of write patterns when allocating resources.
RegionServer Sizing Rules of Thumb. HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small. When many column families exist the flushing and compaction interaction can make for a bunch of needless i/o (To be addressed by changing flushing and compaction to work on a per column family basis). For more information on compactions, see.
Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other.
However, poorly designed row keys are a common source of hotspotting. Hotspotting occurs when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability. This can also have adverse effects on other regions hosted by the same region server as that host is unable to service the requested load.
It is important to design data access patterns such that the cluster is fully and evenly utilized. Salting Salting in this sense has nothing to do with cryptography, but refers to adding random data to the start of a row key. In this case, salting refers to adding a randomly-assigned prefix to the row key to cause it to sort differently than it otherwise would.
The number of possible prefixes correspond to the number of regions you want to spread the data across. Salting can be helpful if you have a few 'hot' row key patterns which come up over and over amongst other more evenly-distributed rows. Consider the following example, which shows that salting can spread write load across multiple RegionServers, and illustrates some of the negative implications for reads. In the HBase chapter of Tom White’s book (O’Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table’s regions (and thus, a single node), then moving onto the next region, etc.
With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores:.
The pile-up on a single region brought on by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it’s best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key. If you do need to upload time series data into HBase, you should study as a successful example. It has a page describing the it uses in HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of (recommended!). Therein, the indices that are kept on HBase storefiles () to facilitate random access may end up occupying large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices.
See the thread up on the user mailing list. If you pre-split your table, it is critical to understand how your rowkey will be distributed across the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., '000000' to 'ffffffffffffffff'). Running those key ranges through Bytes.split (which is the split strategy used when creating regions in Admin.createTable(byte[] startKey, byte[] endKey, numRegions) for 10 regions will generate the following splits. 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0 54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6 61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // = 68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K 82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R 88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X 95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _ 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f. The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a 'lumpy' (and possibly 'hot') region problem.
To understand why, refer to an. '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions will never be used. To make pre-splitting work with this example keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) is required. Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via.
The default for min versions is 0, which means the feature is disabled. The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as 'keep the last T minutes worth of data, at most N versions, but keep at least M versions around' (where M is the value for minimum number of row versions, M.
HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised usage for Constraints is in enforcing business rules for attributes in the table (e.g. Make sure values are in the range 1-10). Constraints could also be used to enforce referential integrity, but this is strongly discouraged as it will dramatically decrease the write throughput of the tables where integrity checking is enabled. Extensive documentation on using Constraints can be found at since version 0.94. Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatibility of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.
My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we’ll always need the same page size. I’ve ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case.
I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we’d need to update all subsequent rows). Your two options mirror a common question people have when designing HBase schemas: should I go 'tall' or 'wide'? How To Install Desktop Theme Pack. Your first schema is 'tall': each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means 'the value'. This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done.
What you’re giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn’t sound like you need that. Doing it this way is generally recommended (see here ). Your second option is 'wide': you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row.
I’m guessing you jumped to the 'paginated' version because you’re assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you’re not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn’t be fundamentally worse. The client has methods that allow you to get specific slices of columns. A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don’t have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value).
Start simple and iterate!:). XX:CMSInitiatingOccupancyFraction=70 • Optimize for low collection latency rather than throughput: -Xmn512m • Collect eden in parallel: -XX:+UseParNewGC • Avoid collection under pressure: -XX:+UseCMSInitiatingOccupancyOnly • Limit per request scanner result sizing so everything fits into survivor space but doesn’t tenure. In hbase-site.xml, set hbase.client.scanner.max.result.size to 1/8th of eden space (with - Xmn512m this is ~51MB ) • Set max.result.size x handler.count less than survivor space. HBase timeline consistency (HBASE-10070) With read replicas enabled, read-only copies of regions (replicas) are distributed over the cluster. One RegionServer services the default or primary replica, which is the only replica that can service writes. Other RegionServers serve the secondary replicas, follow the primary RegionServer, and only see committed updates. The secondary replicas are read-only, but can serve reads immediately while the primary is failing over, cutting read availability blips from seconds to milliseconds.
Phoenix supports timeline consistency as of 4.4.0 Tips. There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred and org.apache.hadoop.hbase.mapreduce. The former does old-style API and the latter the new mode.
The latter has more facility though you can usually find an equivalent in the older package. Pick the package that goes with your MapReduce deploy. When in doubt or starting over, pick org.apache.hadoop.hbase.mapreduce. In the notes below, we refer to o.a.h.h.mapreduce but replace with o.a.h.h.mapred if that is what you are using. To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory.
You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references.
It also requires you restart the Hadoop cluster before Hadoop can use the HBase data. Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The dependencies only need to be available on the local CLASSPATH and from here they’ll be picked up and bundled into the fat job jar deployed to the MapReduce cluster. A basic trick just passes the full hbase classpath — all hbase and dependent jars as well as configurations — to the mapreduce job runner letting hbase utility pick out from the full-on classpath what it needs adding them to the MapReduce job configuration (See the source at TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done). The following example runs the bundled HBase MapReduce job against a table named usertable.
It sets into HADOOP_CLASSPATH the jars hbase needs to run in an MapReduce context (including configuration files such as hbase-site.xml). Be sure to use the correct version of the HBase JAR for your system; replace the VERSION string in the below command line w/ the version of your local hbase install. The backticks ( ` symbols) cause the shell to execute the sub-commands, setting the output of hbase classpath into HADOOP_CLASSPATH. This example assumes you use a BASH-compatible shell.
Optimizing the caching settings is a balance between the time the client waits for a result and the number of sets of results the client needs to receive. If the caching setting is too large, the client could end up waiting for a long time or the request could even time out. If the setting is too small, the scan needs to return results in several pieces. If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the bucket. When you read from HBase, the TableInputFormat requests the list of regions from HBase and makes a map, which is either a map-per-region or mapreduce.job.maps map, whichever is smaller. If your job only has two maps, raise mapreduce.job.maps to a number greater than the number of regions. Maps will run on the adjacent TaskTracker/NodeManager if you are running a TaskTracer/NodeManager and RegionServer per node.
When writing to HBase, it may make sense to avoid the Reduce step and write back into HBase from within your map. This approach works when your job does not need the sort and collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is no point double-sorting (and shuffling data around your MapReduce cluster) unless you need to. If you do not need the Reduce, your map might emit counts of records processed for reporting at the end of the job, or set the number of Reduces to zero and use TableOutputFormat. If running the Reduce step makes sense in your case, you should typically use multiple reducers so that load is spread across the HBase cluster.
Hbase.security.authentication.spnego.kerberos.principal HTTP/_HOST@EXAMPLE.COM Required for SPNEGO, the Kerberos principal to use for SPNEGO authentication by the web server. The _HOST keyword will be automatically substituted with the node's hostname. Hbase.security.authentication.spnego.kerberos.keytab /etc/security/keytabs/spnego.service.keytab Required for SPNEGO, the Kerberos keytab file to use for SPNEGO authentication by the web server. Hbase.security.authentication.spnego.kerberos.name.rules Optional, Hadoop-style `auth_to_local` rules which will be parsed and used in the handling of Kerberos principals hbase.security.authentication.signature.secret.file Optional, a file whose contents will be used as a secret to sign the HTTP cookies as a part of the SPNEGO authentication handshake. If this is not provided, Java's `Random` library will be used for the secret. Hbase.thrift.keytab.file /etc/hbase/conf/hbase.keytab hbase.thrift.kerberos.principal $USER/_HOST@HADOOP.LOCALDOMAIN hbase.thrift.dns.interface default hbase.thrift.dns.nameserver default. Hbase.rest.support.proxyuser true hbase.rest.authentication.type kerberos hbase.rest.authentication.kerberos.principal HTTP/_HOST@HADOOP.LOCALDOMAIN hbase.rest.authentication.kerberos.keytab $KEYTAB hbase.rest.dns.interface default hbase.rest.dns.nameserver default.
Hbase.security.authentication simple hbase.security.authorization true hbase.coprocessor.master.classes org.apache.hadoop.hbase.security.access.AccessController hbase.coprocessor.region.classes org.apache.hadoop.hbase.security.access.AccessController hbase.coprocessor.regionserver.classes org.apache.hadoop.hbase.security.access.AccessController. All of the data under management is kept under the root directory in the file system ( hbase.rootdir).
Access to the data and WAL files in the filesystem should be restricted so that users cannot bypass the HBase layer, and peek at the underlying data files from the file system. HBase assumes the filesystem used (HDFS or other) enforces permissions hierarchically. If sufficient protection from the file system (both authorization and authentication) is not provided, HBase level authorization control (ACLs, visibility labels, etc) is meaningless since the user can always access the data from the file system. HBase enforces the posix-like permissions 700 ( rwx------) to its root directory. It means that only the HBase user can read or write the files in FS. The default setting can be changed by configuring hbase.rootdir.perms in hbase-site.xml.
A restart of the active master is needed so that it changes the used permissions. For versions before 1.2.0, you can check whether HBASE-13780 is committed, and if not, you can manually set the permissions for the root directory if needed. Using HDFS, the command would be. In secure mode, SecureBulkLoadEndpoint should be configured and used for properly handing of users files created from MR jobs to the HBase daemons and HBase user. The staging directory in the distributed file system used for bulk load ( hbase.bulkload.staging.dir, defaults to /tmp/hbase-staging) should have (mode 711, or rwx—x—x) so that users can access the staging directory created under that parent directory, but cannot do any other operation.
See for how to configure SecureBulkLoadEndPoint.
$3,549,360 $6M Dear Internet Archive Supporter, I ask only once a year: please help the Internet Archive today. We’re an independent, non-profit website that the entire world depends on. Most can’t afford to donate, but we hope you can. The average donation is about $41.
If everyone chips in $5, we can keep this going for free. For a fraction of the cost of a book, we can share that book online forever.
When I started this, people called me crazy. Collect web pages? Who’d want to read a book on a screen? For 21 years, we’ve backed up the Web, so if government data or entire newspapers disappear, we can say: We Got This.
The key is to keep improving—and to keep it free. We have only 150 staff but run one of the world’s top websites. We’re dedicated to reader privacy. We never accept ads.
But we still need to pay for servers and staff. The Internet Archive is a bargain, but we need your help. If you find our site useful, please chip in. —Brewster Kahle, Founder, Internet Archive.
$3,549,360 $6M Dear Internet Archive Supporter, I ask only once a year: please help the Internet Archive today. We’re an independent, non-profit website that the entire world depends on. Most can’t afford to donate, but we hope you can. The average donation is about $41. If everyone chips in $5, we can keep this going for free.
For a fraction of the cost of a book, we can share that book online forever. When I started this, people called me crazy. Collect web pages? For 21 years, we’ve backed up the Web, so if government data or entire newspapers disappear, we can say: We Got This.
We’re dedicated to reader privacy. We never accept ads. But we still need to pay for servers and staff. If you find our site useful, please chip in.
—Brewster Kahle, Founder, Internet Archive. $3,549,360 $6M Dear Internet Archive Supporter, I ask only once a year: please help the Internet Archive today. We’re an independent, non-profit website that the entire world depends on.
Most can’t afford to donate, but we hope you can. The average donation is about $41. If everyone chips in $5, we can keep this going for free. For a fraction of the cost of a book, we can share that book online forever. When I started this, people called me crazy.
Collect web pages? For 21 years, we’ve backed up the Web, so if government data or entire newspapers disappear, we can say: We Got This. We’re dedicated to reader privacy. We never accept ads. But we still need to pay for servers and staff. If you find our site useful, please chip in.
—Brewster Kahle, Founder, Internet Archive. Dear Internet Archive Supporter, I ask only once a year: please help the Internet Archive today. We’re an independent, non-profit website that the entire world depends on. If everyone chips in $5, we can keep this going for free. For a fraction of the cost of a book, we can share that book online forever. When I started this, people called me crazy.
Collect web pages? For 21 years, we’ve backed up the Web, so if government data or entire newspapers disappear, we can say: We Got This.
We never accept ads, but we still need to pay for servers and staff. If you find our site useful, please chip in. —Brewster Kahle, Founder, Internet Archive.