Changing Strings To Integers The use of Hash, HashCode, And CRC-32 In ColdFusion

Changing Strings To Integers The use of Hash, HashCode, And CRC-32 In ColdFusion

[ad_1]

In my Characteristic Flags ebook, I had a snippet of code that makes use of the CRC-32 checksum as a way to map strings to integers for person concentrated on. In his overview of the ebook, Adam Cameron steered that my code used to be longer than it had to be; and, that I would possibly use the hash() serve as for brevity. This gave me pause; particularly in gentle of James Moberg’s fresh publish on the use of .hashCode(). I need to take a second and step again and take into consideration how and why I am changing strings to integers in ColdFusion.

The underlying mechanism of mapping a string onto an integer is a hashing serve as. A hashing serve as is any serve as that takes an enter of arbitrary duration and decreases it down an output with a predictable duration. There are lots of other hashing algorithms, each and every of which is more-or-less appropriate for any specific drawback area. The CRC-32 checksum, MD5 hash, and Java’s String.hashCode() are all examples of hashing algorithms.

Within the characteristic flags drawback area, mapping a string onto an integer is essential when distributing a variant cost the use of a percent-based rollout. As an example, for those who sought after to permit a characteristic flag for five% of customers, your characteristic flags SDK (Device Developer Equipment) would want a method to decide if a given request falls into that 5% cohort or into the rest 95% cohort. It does this through mapping the request key—which is a string—onto a numeric share (1-100).

This mapping of strings onto integers will have to be constant throughout requests and throughout software reboots. If the mapping had been inconsistent, a characteristic flag could be enabled for a person in a single request after which mysteriously disabled for a similar person in a next request. This unpredictability would undermine the value-add of characteristic flags.

Moreover, the mapping of strings onto integers must be lightly dispensed on stability. Which means that after I map a person’s request key onto the 1-100 share integer, each and every cost must more or less map to at least one% of the total user-base.

All to mention, within the context of characteristic flags, I do no longer want a hashing serve as this is cryptographically safe. I best want a hashing serve as this is speedy, constant, and lightly dispensed.

To mess around with other mapping purposes, I created an Encoder.cfc ColdFusion part that gives 3 strategies for mapping arbitrary strings onto sure integers. Every manner makes use of a special hashing set of rules: CRC-32, MD5, String.hashCode():

part
	output = false
	trace = "I supply quite a lot of strategies for encoding a string as an integer."
	{

	/**
	* I initialize the encoder.
	*/
	public void serve as init() {

		// Application values for use in strategies.
		variables.BigIntClass = createObject( "java", "java.math.BigInteger" );
		variables.maxBigInt = BigIntClass.init( 2147483647 );

	}

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I encode the given enter the use of the Java local hashCode().
	*/
	public numeric serve as usingHashCode( required string enter ) {

		go back( abs( javaCast( "string", enter ).hashCode() ) );

	}


	/**
	* I encode the given enter the use of the ColdFusion local hash().
	*/
	public numeric serve as usingHash( required string enter ) {

		go back BigIntClass
			.init( hash( enter ), 16 )
			.mod( maxBigInt )
			.intValue()
		;

	}


	/**
	* I encode the given enter the use of a CRC-32 checksum.
	*/
	public numeric serve as usingCRC( required string enter ) {

		var checksum = createObject( "java", "java.util.zip.CRC32" )
			.init()
		;
		checksum.replace( charsetDecode( enter, "utf-8" ) );

		go back BigIntClass
			.valueOf( checksum.getValue() )
			.mod( maxBigInt )
			.intValue()
		;

	}

}

Since one of the hashing algorithms lead to a worth that can not be are compatible right into a 32-bit integer, I am the use of the modulo operator (on this case, .mod() manner on BigInteger) to additional scale back the price right down to an integer.

The CRC-32 and MD5 algorithms are constant through design. The String.hashCode() manner is a little more contentious. Within the Java documentation of Object.hashCode(), it says that:

On every occasion it’s invoked at the identical object greater than as soon as all through an execution of a Java software, the hashCode manner will have to constantly go back the similar integer, equipped no knowledge utilized in equals comparisons at the object is changed. This integer don’t need to stay constant from one execution of an software to any other execution of the similar software.

Alternatively, for those who take a look at the overridden manner within the String magnificence, it explicitly defines the set of rules for hashing the string:

Returns a hash code for this string. The hash code for a String object is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

What I take this to imply is that whilst the extra generic Object.hashCode() is unpredictable throughout techniques, String.hashCode() is predictable throughout techniques. And, actually, that is what James Moberg noticed in his exploration as smartly (see previous hyperlink).

Assuming constant mapping, I am nonetheless curious to discover velocity and distribution. For velocity, I will see how repeatedly I will be able to execute each and every set of rules in 3 seconds. For the enter, I am the use of a definite cost that looks as if an IP cope with (with the intention to extra intently reflect what I would possibly use in an actual global software):

<cfscript>

	encoder = new Encoder();
	maxDuration = 3000;

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	depend = runFor(
		( i ) => {

			go back( encoder.usingHashCode( "#i#.192.#i#.168.#i#" ) );

		},
		maxDuration
	);

	echoLine( "HashCode: #numberFormat( depend )#" );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	depend = runFor(
		( i ) => {

			go back( encoder.usingHash( "#i#.192.#i#.168.#i#" ) );

		},
		maxDuration
	);

	echoLine( "Hash: #numberFormat( depend )#" );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	depend = runFor(
		( i ) => {

			go back( encoder.usingCRC( "#i#.192.#i#.168.#i#" ) );

		},
		maxDuration
	);

	echoLine( "CRC-32: #numberFormat( depend )#" );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I run the given operator as repeatedly as I will be able to within the given length.
	*/
	public numeric serve as runFor(
		required serve as operator,
		required numeric durationInMilliseconds
		) {

		var cutoffAt = ( getTickCount() + durationInMilliseconds );
		var executionCount = 0;

		whilst ( getTickCount() < cutoffAt ) {

			operator( ++executionCount );

		}

		go back( executionCount );

	}


	/**
	* I output the given cost adopted through a BR tag.
	*/
	public void serve as echoLine( string cost = "" ) {

		writeOutput( cost & "<br />" );

	}

</cfscript>

If I run this in Lucee CFML, I am getting the next output:

  • HashCode: 18,981 (oddly gradual!)
  • Hash: 166,875
  • CRC-32: 516,026

And, if I run this in Adobe ColdFusion, I am getting the next output:

  • HashCode: 645,021
  • Hash: 414,450
  • CRC-32: 265,481

It is abnormal that the 2 other CFML engines have inverse velocity results. This may well be one thing in the best way I am calling the code; or, it may well be one thing within the engine implementations.

It is specifically abnormal how a lot slower String.hashCode() appears to be in Lucee CFML for my specific inputs. It is a Java-layer manner name; and, either one of my native CommandBox servers seem to be the use of Java 11.0.21 (Homebrew) 64bit. As such, I’ve to consider that there is something funky in my code this is simply no longer obtrusive.

That mentioned, those strategies are all beautiful speedy (bar that one oddity). So, let’s take a look at the distribution of inputs onto outputs. As I discussed above, in a characteristic flags context, we wish to map person key values onto 1-100 share buckets. As such, I will use the modulo operator (%) to additional scale back each and every cost:

<cfscript>

	encoder = new Encoder();
	iterations = 100000;

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	buckets = [];
	arraySet( buckets, 1, 100, 0 );

	for ( i = 1 ; i <= iterations ; i++ ) {

		encodedInt = encoder.usingHashCode( "#i#.192.#i#.168.#i#" );
		bucketIndex = ( ( encodedInt % 100 ) + 1 );
		buckets[ bucketIndex ]++;

	}

	printBuckets( "The use of Hash Code", buckets );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	buckets = [];
	arraySet( buckets, 1, 100, 0 );

	for ( i = 1 ; i <= iterations ; i++ ) {

		encodedInt = encoder.usingHash( "#i#.192.#i#.168.#i#" );
		bucketIndex = ( ( encodedInt % 100 ) + 1 );
		buckets[ bucketIndex ]++;

	}

	printBuckets( "The use of Hash", buckets );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	buckets = [];
	arraySet( buckets, 1, 100, 0 );

	for ( i = 1 ; i <= iterations ; i++ ) {

		encodedInt = encoder.usingCRC( "#i#.192.#i#.168.#i#" );
		bucketIndex = ( ( encodedInt % 100 ) + 1 );
		buckets[ bucketIndex ]++;

	}

	printBuckets( "The use of CRC-32", buckets );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I output the buckets as a horizontal bar-chart.
	*/
	public void serve as printBuckets(
		required string name,
		required array buckets
		) {

		```
		<cfoutput>
			<div taste="go with the flow: left ; margin-right: 20px ;">
				<h2>
					#encodeForHtml( name )#
				</h2>
				<cfloop index="native.cost" array="#buckets#">
					<div
						taste="background: hotpink ; width: #( cost / 4 )#px ; peak: 4px ;">
					</div>
				</cfloop>
			</div>
		</cfoutput>
		```

	}

</cfscript>

This simply helps to keep a working tally of the way repeatedly each and every bucket index is generated over 100,000 iteration. After which, plots it as a horizontal bar chart:

As you’ll be able to see, each and every of the hashing algorithms offers us a “excellent sufficient” distribution, on stability.

That mentioned, I by chance spotted one thing in reality abnormal! When working the distribution code, if I modify the enter from:

"#i#.192.#i#.168.#i#"

to:

"192.#i#.168.#i#"

(realize that I have got rid of the main #i#.), it radically alters the distribution for the String.hashCode() execution:

Right here, it kind of feels that each and every different share bucket within the String.hashCode() set of rules is 0. In the meantime, the opposite two hashing algorithms proceed to have an lightly dispensed mapping.

I am not certain what’s going on right here; however, it will have to have one thing to do with the truth that the main characters on this latter execution are all the time the similar (192.); which will have to have an oversized affect at the general hash code era. I am satisfied to have by chance stumbled upon this abnormal result.

In any case, I’m going to most certainly simply keep on with the CRC-32 hashing set of rules as the instance within the ebook. The String.hashCode() would had been the nicest method to make use of as a result of it is so easy; however, I do not need to lead folks down the unsuitable trail. Nonetheless, it used to be excellent to step again and suppose extra deeply about why I am mapping strings to integers and the way I am opting for to do this.

Need to use code from this publish?
Take a look at the license.



[ad_2]

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back To Top
0
Would love your thoughts, please comment.x
()
x