Transcription

DW Loading and RefreshingTechniques: ETLRobert WrembelPoznan University of TechnologyInstitute of Computing SciencePoznań, znan.pl/rwrembelETL in DWS architectureDATA SOURCESINTEGRATION LAYERETLOPERATIONALDATA STOREANALYTICSDATAWAREHOUSE R.Wrembel - Poznan University of Technology, Institute of Computing Science21

Developing ETL Designing and developing ETL processes critical for DW functioning challenges data quality data freshness performance of ETL execution (time window for a DWrefreshing) source evolution ETL optimization costly (time & money) up to 70% project resources– staff– hardware– software R.Wrembel - Poznan University of Technology, Institute of Computing Science3Developing ETL Gartner Report on DW projects in financial institutionsfrom the Fortune 500 list 100 of staff in a DW project55 ETL17 system admins (DB, hardware)4 system architects9 BI consultants5 programmers9 managershardware (multiproc. servers, TB disks, 5mln USD)ETL software (1mln USD) # data sources: 10 to 50 R.Wrembel - Poznan University of Technology, Institute of Computing Science42

Developing ETL # data sources to integrate large banks: hundreds Types of data sources to integrate databases (all possible)text filesspreadsheetsstreaming data (more and more frequently) R.Wrembel - Poznan University of Technology, Institute of Computing Science5Developing ETL Data Science Report. 2016, CrowdFlower R.Wrembel - Poznan University of Technology, Institute of Computing Science63

Developing ETL Panoply Data Warehouse Trends Report 2018 R.Wrembel - Poznan University of Technology, Institute of Computing Science7ETL architectureDATA REHOUSEStaging Area / Operational Data Store R.Wrembel - Poznan University of Technology, Institute of Computing Science8/724

Example ETL process Talend Open Studiosorterfile readercolumn projectionand filteringresult inspection(for testing) R.Wrembel - Poznan University of Technology, Institute of Computing Sciencefile writer9DW designing Analysis of available data sourcesDeciding on DS access technologies (see Topic 1)Data profilingData ingest full incremental TransformingCleaning and homogenizingMergingDuplicate eliminationUploading into a DW R.Wrembel - Poznan University of Technology, Institute of Computing Science105

Data sources Identify relevant DSs DS description business area (e.g., HR, payroll, sales, loans, marketing, .)importancebusiness userbusiness ownertechnical/infractructure ownerhardware OSDBMSschema# transacions/daydata volume increase/dayDB size R.Wrembel - Poznan University of Technology, Institute of Computing Science11Typical predefined connectors IBM DB2SQL ServerOracleSybase ASE, IQNetezzaVerticaTeradataSASSAP HanaGreenplum PostgreSQLMySQLSQLiteFireBirdODBC data sourceJDBC data sourceExcellAccessText, XML, JSON filesHiveImpalaMongoDBCassandra. R.Wrembel - Poznan University of Technology, Institute of Computing Science126

Data profiling Analyzing data sources Main categories of tasks structure discovery (schema, relationships) content analysis (data values, data quality, daily sizeincrease) relationship discovery Application areas ETL for DW data conversion and migration data quality analysis in production DSs Tools statistics data mining R.Wrembel - Poznan University of Technology, Institute of Computing Science13Data profiling Data types and allowed lengths Discovering schema UNIQUE attributesPK candidatesFK candidatesfunctional dependenciesembedded value dependencies (if a denormalized schema) Statistics on data min, max, count, avg, distinct, variance, stdev Computing data distributions (histograms) Assessing costs of potential joins R.Wrembel - Poznan University of Technology, Institute of Computing Science147

Data profiling Discovering data quality identify NULL/NOT NULL columnscount #rows with NULL or default value for each attributeidentify valid allowed values for attributesidentify domains of attributescount #records with values other than expecteddiscover value formats (date, phone No, address, .)discover outliersdiscover wrong values% of: missing values, typos, non-standardized values R.Wrembel - Poznan University of Technology, Institute of Computing Science15Data profiling tools Open source Quadient DataCleanerAggregate ProfilerTalend Open Studio for Data QualityMelissa Data Profiler Commercial IBM InfoSphere Information AnalyzerSAP Business Objects Data Services for Data ProfilingInformatica Data Profiling Solution – Data ExplorerOracle Enterprise Data QualitySAS DataFlux R.Wrembel - Poznan University of Technology, Institute of Computing Science168

Loading data into DW Reading the whole data source text/binary dump files DB exportXML filesSQL select gateway / ODBCsnapshots Reading changes need to detect data changes R.Wrembel - Poznan University of Technology, Institute of Computing Science17Detecting data changes Requirements minimum or none source system changes minimum interference with a data source Solutions audit columns snapshot comparison system maintained log of changes on a table (e.g.,snapshot log) snapshots triggers synchronous transfer analysis of a redo log (transaction log) periodically (log scraping) on-line - continuously (log sniffing) R.Wrembel - Poznan University of Technology, Institute of Computing Science189

Snapshot/replica Copy of a table or a subset of its columns and rows Refreshing automatic with a defined interval on demand SQL Server IBM DB2 Oracle R.Wrembel - Poznan University of Technology, Institute of Computing Science19Data transformation Transform to a common data model h. Transform semantically identical data to a common (thesame) representation extract text values (e.g., parts of an address) Remove unnecessary columns R.Wrembel - Poznan University of Technology, Institute of Computing Science2010

Data transformation Merge semantically identical records do the recordsrepresent the sameentity? R.Wrembel - Poznan University of Technology, Institute of Computing Science21Data transformation Requirements iterative and interactive process define transformationrun processverify resutlsmodify transformation (if needed)run process. extendible and easy to modify as much data as possible should be transformedautomatically as much steps as possible should be automatic R.Wrembel - Poznan University of Technology, Institute of Computing Science2211

Data cleaning Remove/replace null values Correct typos dictionaries (spelling, names, cities, countries) Correct semantical errors gross net vat address consistent with ZIP code Standardize values date formatcurrencycapital/small lettersabbreviationssynonyms (Word Net) R.Wrembel - Poznan University of Technology, Institute of Computing Science23Data deduplication Records must be cleaned no special signs, no punctuations no abbreviations Problem: how to decide if 2 records represent the sameentity? {Wrembel, Robert, ul. Wyspiańskiego, Poznań} {Wrębel, Robert, ul. Wyspiańskiego, Poznań} Case 1: natural identifiers (e.g., ID, email, mobilenumber) available Case 2: no natural identifiers available approximate/probabilistic decision based on a similaritymeasure R.Wrembel - Poznan University of Technology, Institute of Computing Science2412

Data deduplication Simple similarity measure # matching atomic tokens (text strings) / # total atomictokens in compared records # total atomic tokens: 8 # matching atomic tokens: 3 similarity 3/8Université Lumière Lyon 2 , Laboratorie EricUniv. Lyon 2 , ERIC R.Wrembel - Poznan University of Technology, Institute of Computing Science25Data deduplication Soundex returns a code of pronunciation of an input soundex('Smith') soundex('Smit') S530 Levenhstein/edit distance minimum number of inserts and deletes (updates) ofcharacters in order to convert L1 to L2 L1 and L2 identical didtance 0 ABC ABCDEF: distance 3 DEFCAB ABC: distance 5 R.Wrembel - Poznan University of Technology, Institute of Computing Science2613

DW refreshing When? synchronous (after a source transaction was committed) (near) real-time DW asynchronous traditional DW with a defined frequency on demand How? full (1st DW load) incremental (all next loads) How data arrive? batch traditional DW stream (near) real-time DW R.Wrembel - Poznan University of Technology, Institute of Computing Science27DW refreshing In a constraint time window (typically 8h) SSD throughput: 500MB/s 1TB in approx. 35min Magnetic disc throughput: 100MB/s 1TB in approx. 3h Efficiency is crucial read from DSs only necessary data Do not execute in a DS sorting DISTINCT set operators GROUP BY NOT and non-equijoins (typically require full scan) functions in the WHERE clause R.Wrembel - Poznan University of Technology, Institute of Computing Science2814

DW refreshing Where to filter data? at a data source, if not overloaded with its proper processing powerful query optimizer good use of indexes in an ETL layer, otherwise sorting in a database sorting in an OS (awk) Separate inserts from updates updates standard path inserts direct load path Decide how to maintain additional data structures indexes materialized views Integrity constraints in a DW? R.Wrembel - Poznan University of Technology, Institute of Computing Science29DW refreshing Parallel loading Collecting DB statistics after refreshing DB defragmentation R.Wrembel - Poznan University of Technology, Institute of Computing Science3015

Summary: ETL design processData profilingDefine and deploy ETL processesTest on a sample and verify a resultrepositoryRun production ETLModify data sources(to improve data quality) Jarke M., et. al.: Improving OLTP Data Quality Using Data Warehouse Mechanisms.SIGMOD Record, (28):2, 1999 R.Wrembel - Poznan University of Technology, Institute of Computing Science31Implementing ETL ETL workflow graph of tasks connected by flows Typical tasks aggregation (max, sum, .)filteringjoinlook-upsequence generationsortingsplitting flowsmerging flowsUDF. R.Wrembel - Poznan University of Technology, Institute of Computing Science3216

Metadata On data sources On ETL processes On data warehouse On data sources location (IP address)hardware operating systemtype (RBD, OBD, XML, spreadsheet, .)schemaaccess methods (SQL, XQuery, dump file, .)connection credentialsresults of data profilingvolumeperformance characteristics R.Wrembel - Poznan University of Technology, Institute of Computing Science33Metadata On ETL data storage architecture of ODS and DW (e.g., diskcapacities, row-store / column-store) metadata on a dataset to upload to DW (e.g., size, avg.record lengths) definitions of ETL tasks/steps available dictionaries (e.g., cities, zip codes, names) workflow execution schedules execution statistics (e.g., elapsed time, CPU time, #I/O,RAM usage, throughput, disc access conflicts, #recordsuploaded, #records rejected) dependencies between tasks for impact analysis mappings between DS and DW structures data lineage execution logs R.Wrembel - Poznan University of Technology, Institute of Computing Science3417

Requirements for ETL Efficiency finishing in a predefined time window estimating execution termination Optimizable Fault-tolerance restart after removing errors from a break point restart from the beginning recovery after crash Manageability scheduling executions time-based token-based stopping and restarting tasks impact analysis easy modifiable workflows R.Wrembel - Poznan University of Technology, Institute of Computing Science35Requirements for ETLProducing high quality dataSecurity: access controlA palette of predefined stepsAutomatic code generationSupport of UDFsAutomatic reporting on termination, errors, exceptions,and progress Parallel processing Direct path loading Semi-automatic adjustment to DS changes R.Wrembel - Poznan University of Technology, Institute of Computing Science3618

Off-the-shelf vs. in-house Off-the-shelf faster design anddeployment integrated data repository metadata management workflow executionscheduling built-in drivers to multipleDSs impact analysis incremental data loading parallel processing price often require moreadvanced architectures cost In-house-developed longer design anddevelopment thorrough testing dedicated to a givenscenario not customizable/flexible may be tuned to a givenscenario may be less expensive R.Wrembel - Poznan University of Technology, Institute of Computing Science37Off-the-shelf technology? Financial Times (18 Sep, 2013): Why big IT projects crash www.ft.com ". the Texas state auditor’s office examined 13 IT projects,nine of which had overrun. It concluded, admittedly on a smallsample, that agencies using commercial off-the-shelftechnology exceeded their budgets by a smaller amount andtook less time to complete their projects than those that didnot" R.Wrembel - Poznan University of Technology, Institute of Computing Science3819

Gartner Report Commercial IBM Data StageInformaticaMicrosoft Integration ServicesABInitio Open-source Talend Open StudioPentaho Data IntegrationCloverETLApache NiFi R.Wrembel - Poznan University of Technology, Institute of Computing Science3920