Adding Value to Data: The Power of Data Transformation for Improved Analytical Outcomes

MalBot · August 27, 2023, 1:05pm

Accessing abundant data, including metadata, plays a crucial role in accelerating law enforcement investigations and allowing authorities to quickly adapt to emerging situations and threats. However, the value of the insights gained is dependent on the data being properly cleansed, enriched, fused and analyzed. Applying different transformations through ETL (Extract-Transform-Load), ELT (Extract-Load-Transform), and Federated Search is critical to extracting high-quality analytical insights. But what do the transformations look like?

For example, July 4, 1976 (07/04/1976) is like any other date and can be correlated, referenced, or aggregated with other events or transactions happening that same day. Readers based in the US might recognize that it represents the 200^th anniversary of adopting the US Declaration of Independence. However, on an analytical level there is much more to this date (or any date). Performing basic transformations exposes new dimensions that analytical platforms use to identify specific meanings, patterns, and behaviors.

Known information about July 4, 1976:

Dimension	Value
Year	1976
Month of Year	July
Day of Month	4
Day of Week	Sunday
Week of Month	2
Week of Year	28
Day of Year	186
Quarter-Calendar	3
Quarter-Financial	4
Season	Summer
Holiday-US-Federal	Yes
Leap Year	Yes

Not only is July 4th a significant US national holiday, but other notable historical events also occurred on this day: West Point was officially opened in 1802, the Philippines gained independence from the US and became a sovereign state in 1946, and the American journalist Geraldo Rivera turned 33 years old (80 today). In other countries, dates and anniversaries are equally important and generally defined in simple look-up tables.

Furthermore, secondary observations can be derived from the data. Being a US national holiday and a Sunday, we know specific types of business establishments, including banks and financial institutions, will be closed. Therefore, a spike in certain activity (e.g., money transfers) on this day may be cause for further investigation. However, in the Middle East, Sunday is the first workday of the week, so different patterns would be expected.

Generally, transforms are categorized into three groups depending on their complexity:

Basic – Simple calculations, substitutions, and formatting (REGEX)

Uppercasing all text, trimming spaces, convert types
Formatting dates into YYYY/MM/DD or extracting day-of-week

Replace characters such as _AVENUE_ with _AVE_
Calculating ratios between values: col_x / col_y = col_z
Trimming US Zip code+4 to a five alphanumeric value
Rounding lat/longs to five decimal places

Intermediate – Custom algorithms, heuristics, and reference data

Performing social security or national ID validation checks (age/issuance/ITIN/death)
Simple checksum verification algorithms (Luhn / Verhoeff)
Running phonetic reduction/hashing algorithms (Eudex, Soundex, Ngrams)
Creating secure one-way cipher text hash functions (MD5, SHA-256)
Converting Zip or postal codes to lat/longs or military grid reference system (MGRS)

Calculating the distance between two lat/long points (Haversine Formula)

Advanced: Often utilize pre-trained machine learning models to define outputs

Creating relationships between people identified in a photograph
Defining the make/model, color, and year of a car from a picture
Identifying weapons, drugs, license plates, and other text/logos in pictures

Flagging stressful situations and content from body-worn camera videos
Parsing and normalizing street addresses into a standardized format
Natural Language Processing (NLP) to identify proper entities and relationships

When loading data, basic cleanups are often run on all the data to trim leading/trailing spaces, remove double spaces, delete special characters (periods, commas, quotes, dashes, etc.), and convert text to uppercase. These transformations are not mutually exclusive, so one transform’s output can be another’s input, and chaining the transforms into a workflow helps ensure consistency. This example shows the transforms applied to an address:

Input: 1600 Pennsylvania Avenue, N.W. Washington, D.C., 20500-0004

First Pass: 1600 PENNSYLVANIA AVENUE NW WASHINGTON DC 205000004

The process continues with standard replacements to help map values to a common denominator. In the example, the address contains the word “AVENUE,” and the United States Postal Service shows variations to include AV, AVE, AVEN, AVENU, AVN, AVNUE. These derivations should always return a defined value, such as AVE. The same goes for other entries such as STREET, PARKWAY, BOULEVARD, ROUTE, etc. Additionally, the +4 for the US Zip code won’t be needed since it does not provide any additional value for geo-position (in this case) and is trimmed to five characters.

Output: 1600 PENNSYLVANIA AVE NW WASHINGTON DC 20500

The more standardized the data values, the more consistent the results. Advanced transformations like LibPostal generate standard address outputs from dirty or erroneous input data. This example is relatively simple, but an address in New York City (20 W 34th St. #123, New York, NY 10001) may cause some parsing inconsistencies. Addresses in other countries like Japan (150-2345 Tokyo-to, Shibuya-ku, Hommachi 2 choume, 4-7, Sunny Mansion 203) might prove too complicated for basic processing scripts not designed for this format.

The results returned from a geo-position transform don’t change the address but are additive to the overall entity details; new attributes are created as placeholders to put this derived data, as shown below. Often, these transformations are designed (containerized) for a particular purpose and are embeddable into the data fusion or decision intelligence system.

Lat = 38.89767

Long = -77.03655

MGRS = 18SUJ23400740

Transformations improve data quality, provide additional dimensions to evaluate, and help standardize the content into consistent and reliable values. All these changes are designed to improve the results generated by an analytical or decision intelligence system. However, a critical piece is still required to tie it all together by merging the different entities created. Our upcoming blog in this series will focus on best practices for performing entity resolution.

Visit NEXYTE.AI to learn more about NEXYTE, the data fusion and machine learning platform revolutionizing decision intelligence.

The post Adding Value to Data: The Power of Data Transformation for Improved Analytical Outcomes appeared first on Cognyte.

Article Link: Data Transformation & ETL | Cognyte