Towards Efficient Dataflow Frameworks for Big Data ...
Research in Digital Science Center Geoffrey Fox, April 18, 2019 Digital Science Center Department of Intelligent Systems Engineering [email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/ Judy Qiu, David Crandall, Gregor von Laszewski, Dennis Gannon Supun Kamburugamuve, Bo Peng, Langshi Chen, Kannan Govindarajan, Fugang Wang nanoBIO Collaboration with several SICE faculty CyberTraining Collaboration with several SICE faculty Internal collaboration. Biology, Physics, SICE Outside Collaborators in funded projects: Arizona, Kansas, Purdue, Rutgers, San Diego Supercomputer Center, SUNY Stony Brook, Virginia, UIUC and Utah NIST and Fudan University in unfunded collaborations Digital Science Center 1 Digital Science Center/ISE Infrastructure Run computer infrastructure for Cloud and HPC research 16 K80 and 16 Volta GPU, 8 Haswell node Romeo used in Deep Learning Course E533 and Research (Volta have NVLink) 26 nodes Victor/Tempest Infiniband/Omnipath Intel Xeon Platinum 48 core nodes 64 node system Tango with high performance disks (SSD, NVRam = 5x SSD and 25xHDD) and Intel KNL (Knights Landing) manycore (68-72) chips.
Omnipath interconnect 128 node system Juliet with two 12-18 core Haswell chips, SSD and conventional HDD disks. Infiniband Interconnect FutureSystems Bravo Delta Echo old but useful; 48 nodes All have HPC networks and all can run HDFS and store data on nodes Supported by Gary Miksik, Allan Streib, Josh Ballard Digital Science Center 2 Digital Science Center Research Activities Building SPIDAL Scalable HPC machine Learning Library Applying current SPIDAL in Biology, Network Science (OSoMe), Pathology, Racing Cars Harp HPC Machine Learning Framework (Qiu)
Twister2 HPC Event Driven Distributed Programming model (replace Spark) Cloudmesh: Research and DevOps for Software Defined Systems (von Laszewski) Intel Parallel Computing Center @IU (Qiu) Edge Computing Work with NIST on Big Data Standards and non-proprietary Frameworks Engineered nanoBIO Node NSF EEC-1720625 with Purdue and UIUC Polar (Radar) Image Processing (Crandall); being used in production Data analysis of experimental physics scattering results
IoTCloud. Cloud control of robots licensed to C2RO (Montreal) Digital Science Center Big Data on HPC Cloud 3 2 http://www.iterativemapreduce.org/ Overall Global AI and Modeling Supercomputer GAIMSC Digital Science Center 2/13/2019 4 From Microsoft aa aa
Digital Science Center 2/13/2019 5 From Microsoft aa aa Digital Science Center 2/13/2019 https://www.microsoft.com/en-us/research/event/faculty-summit-2018/ 6 Overall Global AI and Modeling Supercomputer GAIMSC Architecture There is only a cloud at the logical center but its physically distributed and owned by a few major players There is a very distributed set of devices surrounded by local Fog computing; this forms the logically and physically distribute edge The edge is structured and largely data
These are two differences from the Grid of the past e.g. self driving car will have its own fog and will not share fog with truck that it is about to collide with The cloud and edge will both be very heterogeneous with varying accelerators, memory size and disk structure. What is software model for GAIMSC? Digital Science Center 2/13/2019 7 Collaborating on the Global AI and Modeling Supercomputer GAIMSC Microsoft says: We can only play together and link functionalities from Google, Amazon, Facebook, Microsoft, Academia if we have open APIs and open code to customize We must collaborate Open source Apache software Academia needs to use and define their own Apache projects We want to use AI and modeling supercomputer for AI-Driven engineering and science studying the early universe and the Higgs boson and not just producing annoying advertisements (goal of most elite CS researchers)
Digital Science Center 2/13/2019 8 Systems Challenges for GAIMSC Architecture of the Global AI and Modeling Supercomputer GAIMSC must reflect Global captures the need to mashup services from many different sources; AI captures the incredible progress in machine learning (ML); Modeling captures both traditional large-scale simulations and the models and digital twins needed for data interpretation; Supercomputer captures that everything is huge and needs to be done quickly and often in real time for streaming applications. The GAIMSC includes an intelligent HPC cloud linked via an intelligent HPC Fog to an intelligent HPC edge. We consider this distributed environment as a set of computational and dataintensive nuggets swimming in an intelligent aether. We will use a dataflow graph to define a structure in the aether GAIMSC requires parallel computing to achieve high performance on large ML and simulation nuggets and distributed system technology to build the aether and support the distributed but connected nuggets. In the latter respect, the intelligent aether mimics a grid but it is a data grid where there are computations but typically those associated with data (often from edge devices). So unlike the distributed simulation supercomputer that was often studied in previous grids, GAIMSC is a supercomputer aimed at very different data intensive AI-enriched problems.
Digital Science Center 2/13/2019 9 Integration of Data and Model functions with ML wrappers in GAIMSC There is a increasing use in the integration of ML and simulations. ML can analyze results, guide the execution and set up initial configurations (autotuning). This is equally true for AI itself -- the GAIMSC will use itself to optimize its execution for both analytics and simulations. See The Case for Learned Index Structures from MIT and Google In principle every transfer of control (job or function invocation, a link from device to the fog/cloud) should pass through an AI wrapper that learns from each call and can decide both if call needs to be executed (maybe we have learned the answer already and need not compute it) and how to optimize the call if it really needs to be executed. The digital continuum (proposed by BDEC2) is an intelligent aether learning from and informing the interconnected computational actions that are embedded in the aether. Implementing the intelligent aether embracing and extending the edge, fog, and cloud is a major research challenge where bold new ideas are needed! We need to understand how to make it easy to automatically wrap every nugget with ML. Digital Science Center 2/13/2019
10 Digital Science Center 2/13/2019 11 Implementing the GAIMSC My recent research aims to make good use of high-performance technologies and yet preserve the key features of the Apache Big Data Software. Originally aimed at using HPC to run Machine Learning but this is sort of understood and new focus is integration of ML, machine learning, clouds, edge We will describe Twister2 that seems well suited to build the prototype intelligent highperformance aether. Note this will mix many relatively small nuggets with AI wrappers generating parallelism from the number of nuggets and not internally to the nugget and its wrapper. However, there will be also large global jobs requiring internal parallelism for individual large-scale machine learning or simulation tasks. Thus parallel computing and distributed systems (grids) must be linked in a clean fashion and the key parallel computing ideas needed for ML are closely related to those already developed for simulations. Digital Science Center
2/13/2019 12 2 http://www.iterativemapreduce.org/ Application Requirements Digital Science Center 2/13/2019 13 Distinctive Features of Applications Ratio of data to model sizes: vertical axis on next slide Importance of Synchronization ratio of inter-node communication to node computing: horizontal axis on next slide Sparsity of Data or Model; impacts value of GPUs or vector computing Irregularity of Data or Model Geographic distribution of Data as in edge computing; use of streaming (dynamic data) versus batch paradigms Dynamic model structure as in some iterative algorithms Digital Science Center
2/13/2019 14 Big Data and Simulation Difficulty in Parallelism Loosely Coupled Size of Synchronization constraints Commodity Clouds Size of Disk I/O MapReduce as in scalable databases Pleasingly Parallel Often independent events Current major Big Data category HPC Clouds/Supercomputers Memory access also critical HPC Clouds: Accelerators High Performance Interconnect
Global Machine Learning e.g. parallel clustering Deep Learning Linear Algebra at core (often not sparse) LDA Graph Analytics e.g. subgraph mining Unstructured Adaptive Sparse Structured Adaptive Sparse Parameter sweep simulations Just two problem characteristics There is also data/compute distribution seen in grid/edge computing Digital Science Center Tightly Coupled
Largest scale simulations Exascale Supercomputers 2/13/2019 15 2 http://www.iterativemapreduce.org/ Comparing Spark, Flink and MPI Digital Science Center 2/13/2019 16 Machine Learning with MPI, Spark and Flink Three algorithms implemented in three runtimes Multidimensional Scaling (MDS) Terasort K-Means (drop as no time and looked at later) Implementation in Java MDS is the most complex algorithm - three nested parallel loops
K-Means - one parallel loop Terasort - no iterations (see later) With care, Java performance ~ C performance Without care, Java performance << C performance (details omitted) Digital Science Center 2/13/2019 17 Multidimensional Scaling: 3 Nested Parallel Sections Flink Spark MPI MPI Factor of 20-200 Faster than Spark/Flink MDS execution time on 16 nodes with 20 processes in each node with varying number of points Digital Science Center
MDS execution time with 32000 points on varying number of nodes. Each node runs 20 parallel tasks Spark, Flink No Speedup Flink especially loses touch with relationship of computing and data location In open Wound Pragmas, Twister2 uses Parallel First Touch and Owner Computes Current Big Data systems use forgotten touch, owner forgets and Tragedy of the Commons Computes 2/13/2019 18 Digital Science Center 2 http://www.iterativemapreduce.org/ Linking Machine Learning and HPC Digital Science Center 2/13/2019 20
MLforHPC and HPCforML We distinguish between different interfaces for ML/DL and HPC. HPCforML: Using HPC to execute and enhance ML performance, or using HPC simulations to train ML algorithms (theory guided machine learning), which are then used to understand experimental data or simulations. HPCrunsML: Using HPC to execute ML with high performance SimulationTrainedML: Using HPC simulations to train ML algorithms, which are then used to understand experimental data or simulations. MLforHPC: Using ML to enhance HPC applications and systems MLautotuning: Using ML to configure (autotune) ML or HPC simulations. MLafterHPC: ML analyzing results of HPC, e.g., trajectory analysis in biomolecular simulations MLaroundHPC: Using ML to learn from simulations and produce learned surrogates for the simulations. The same ML wrapper can also learn configurations as well as results MLControl: Using simulations (with HPC) in control of experiments and in objective driven computational campaigns, where simulation surrogates allow real-time predictions.
Digital Science Center 2/13/2019 21 MLAutotuned HPC. Machine Learning for Parameter Auto-tuning in Molecular Dynamics Simulations: Efficient Dynamics of Ions near Polarizable Nanoparticles JCS Kadupitiya, Geoffrey Fox, Vikram Jadhao Integration of machine learning (ML) methods for parameter prediction for MD simulations by demonstrating how they were realized in MD simulations of ions near polarizable NPs. Note ML used at start and end of simulation blocks ML-Based Simulation Configuration Inference I Inference I
Inference II Inference II Testing Testing Training Training Digital Science Center MLaroundHPC: Machine learning for performance enhancement with Surrogates of molecular dynamics simulations We find that an artificial neural ML-Based Simulation Prediction ANN Model Training Inference
ML used during simulation Digital Science Center network based regression model successfully learns desired features associated with the output ionic density profiles (the contact, mid-point and peak densities) generating predictions for these quantities that are in excellent agreement with the results from explicit molecular dynamics simulations. The integration of an ML layer enables real-time and anytime engagement with the simulation framework, thus enhancing the applicability for both research and educational use. Deployed on nanoHUB for education Speedup of MLaroundHPC
Tseq is sequential time Ttrain time for a (parallel) simulation used in training ML Tlearn is time per point to run machine learning Tlookup is time to run inference per instance Ntrain is 7K to 16K in our work Ntrain number of training samples Nlookup number of results looked up Becomes Tseq/Ttrain if ML not used Becomes Tseq/Tlookup (105 faster in our case) if inference dominates (will overcome end of Moores law and win the race to zettascale) This application deployed on nanoHub for high performance education Digital Science Center 2 http://www.iterativemapreduce.org/ Programming Environment for Global AI and Modeling Supercomputer GAIMSC HPCforML and MLforHPC Digital Science Center
2/13/2019 25 Ways of adding High Performance to Global AI (and Modeling) Supercomputer Fix performance issues in Spark, Heron, Hadoop, Flink etc. Messy as some features of these big data systems intrinsically slow in some (not all) cases All these systems are monolithic and difficult to deal with individual components Execute HPBDC from classic big data system with custom communication environment approach of Harp for the relatively simple Hadoop environment Provide a native Mesos/Yarn/Kubernetes/HDFS high performance execution environment with all capabilities of Spark, Hadoop and Heron goal of Twister2 Execute with MPI in classic (Slurm, Lustre) HPC environment Add modules to existing frameworks like Scikit-Learn or Tensorflow either as new capability or as a higher performance version of existing module. Digital Science Center 2/13/2019 26 Integrating HPC and Apache Programming Environments
Harp-DAAL with a kernel Machine Learning library exploiting the Intel node library DAAL and HPC communication collectives within the Hadoop ecosystem. Harp-DAAL supports all 5 classes of data-intensive AI first computation, from pleasingly parallel to machine learning and simulations. Twister2 is a toolkit of components that can be packaged in different ways Integrated batch or streaming data capabilities familiar from Apache Hadoop, Spark, Heron and Flink but with high performance. Separate bulk synchronous and data flow communication; Task management as in Mesos, Yarn and Kubernetes Dataflow graph execution models Launching of the Harp-DAAL library with native Mesos/Kubernetes/HDFS environment Streaming and repository data access interfaces, In-memory databases and fault tolerance at dataflow nodes. (use RDD (Tsets) to do classic checkpointrestart) 2/13/2019 27 Digital Science Center Twister2 Highlights I Big Data Programming Environment such as Hadoop, Spark, Flink, Storm, Heron but with significant differences (improvements) Uses HPC wherever appropriate Links to Apache Software (Kafka, Hbase, Beam) wherever appropriate Runs preferably under Kubernetes and Mesos but Slurm supported Highlight is high performance dataflow supporting iteration, fine-grain,
coarse grain, dynamic, synchronized, asynchronous, batch and streaming Two distinct communication environments DFW Dataflow with distinct source and target tasks; data not message level BSP for parallel programming; MPI is default Rich state model for objects supporting in-place, distributed, cached, RDD style persistence Digital Science Center Twister2 Highlights II Can be a pure batch engine Not built on top of a streaming engine Can be a pure streaming engine supporting Storm/Heron API Not built on on top of a batch engine Fault tolerance as in Spark or MPI today; dataflow nodes define natural synchronization points Many APIs: Data (at many levels), Communication, Task High level (as in Spark) and low level (as in MPI) Component based architecture -- it is a toolkit Defines the important layers of a distributed processing engine Implements these layers cleanly aiming at data analytics and with high performance Digital Science Center Twister2 Highlights III
Key features of Twister2 are associated with its dataflow model Fast and functional inter-node linkage; distributed from edge to cloud or in-place between identical source and target tasks Streaming or Batch nodes (Storm persisent or Spark emphemeral model) Supports both Orchestration (as in Pegasus, Kepler, NiFi) or high performance streaming flow (as in Naiad) model Tset Twister2 datasets like RDD define a full object state model supported across links of dataflow Digital Science Center Twister2 Logistics Open Source - Apache Licence Version 2.0 Github - https://github.com/DSC-SPIDAL/twister2 Documentation - https://twister2.gitbook.io/twister2 with tutorial Developer Group - [email protected] India(1) Sri Lanka(9) and Turkey(2) Started in the 4th Quarter of 2017; reversing previous philosophy which was to modify Hadoop, Spark, Heron; Bootstrapped using Heron code but that code now changed About 80000 Lines of Code (plus 50,000 for SPIDAL+Harp
HPCforML) Languages - Primarily Java with some Python Digital Science Center Twister2 Team 32 Digital Science Center Big Data APIs Started with Map-Reduce Different Data APIs in community Task Graph with computations on data in nodes Data transformation APIs Apache Crunch PCollections Apache Spark RDD
Apache Flink DataSet Apache Beam PCollections Apache Storm Streamlets Apache Storm Task Graph SQL based APIs High-level Data API hides communication and decomposition from the user Lower-level messaging and Task APIs offer harder to use more powerful capabilities Digital Science Center GAIMSC Programming Environment Components I Area Component Coordination Points Execution Semantics Implementation State and Configuration Management; Program, Data and Message Level Mapping of Resources to Bolts/Maps in Containers, Processes, Threads
Spark Flink Hadoop Pregel MPI Parallel Computing modes Plugins for Slurm, Yarn, Mesos, (Dynamic/Static) Marathon, Aurora Job Submission Resource Allocation Monitoring of tasks and migrating Task migration tasks for better resource utilization OpenWhisk Elasticity Heron, OpenWhisk, Kafka/RabbitMQ Streaming and FaaS Events Task System Process, Threads, Queues Task Execution Architecture Specification Task Scheduling Task Graph Digital Science Center
Dynamic Scheduling, Static Scheduling, Pluggable Scheduling Algorithms Static Graph, Dynamic Graph Generation Comments: User API Change execution mode; save and reset state Different systems make different choices - why? Owner Computes Rule Client API (e.g. Python) for Job Management Task-based programming with Dynamic or Static Graph API; FaaS API; Support accelerators (CUDA,FPGA, KNL) 2/13/2019 34 GAIMSC Programming Environment Components II
Area Component Messages Dataflow Communication Communication API Implementation Heron Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler? Conventional MPI, Harp BSP Communication Map-Collective Static (Batch) Data File Systems, NoSQL, SQL Data Access Message Brokers, Spouts Streaming Data Relaxed Distributed Shared
Data Distributed Data Memory(immutable data), Management Set Mutable Distributed Data Upstream (streaming) backup; Fault Lightweight; Coordination Points; Check Pointing Tolerance Spark/Flink, MPI and Heron models Research needed Storage, Digital Science Center Security Messaging, Comments This is user level and could map to multiple communication systems Streaming, ETL data pipelines; Define new Dataflow communication API and library
MPI Point to Point and Collective API Data API Data Transformation API; Spark RDD, Heron Streamlet Streaming and batch cases distinct; Crosses all components Crosses all2/13/2019 Components 35 Execution as a Graph for Data Analytics The graph created by the user API can be executed using an event model The events flow through the edges of the graph as messages The compute units are executed upon arrival of events Supports Function as a Service Execution state can be checkpointed automatically with natural synchronization at node boundaries Fault tolerance T T
Task Schedule Graph Digital Science Center Events flow through edges Execution Graph (Plan) R HPC APIs Dominated by Message Passing Interface (MPI) Provides the most fundamental requirements in the most efficient ways possible Communication between parallel workers Managing of parallel processes HPC has task systems and Data APIs They are all built on top of parallel communication
libraries Legion from Stanford on top of CUDA and active messages (GASNet) Actually HPC usually defines model parameter APIs and Big Data Data APIs One needs both data and model parameters treated similarily in many cases Digital Science Center Simple MPI Program Twister2 Features by Level of Effort Feature Current Lines of Code Mesos+Kubernetes+Slurm Integration + Resource and Job Management (Job master) 15000 Task system (scheduler + dataflow graph + executor) 10000
DataFlow operators, Twister:Net Fault tolerance Python API Tset and Object State Apache Storm Compatibility Apache Beam Connection Connected (external) Dataflow Data Access API Connectors ( RabbitMQ, MQTT, SQL, HBase etc.) 20000 2000 Utilities and common code Application Test code Dashboard 9000 (5000 + 4000) 10000 4000 Digital Science Center Near-term Addons
3000 5000 2500 2000 5000 5000 1000 (Kafka) 2000-5000 4000-8000 5000 10000 Twister2 Implementation by Language Language Files Blank Lines Comment Lines Line of
338 Digital Science Center Software Engineering will double amount of code with unit tests etc. Runtime Components Atomic Job Submission Connected or External DataFlow Orchestration API Streaming, Batch and ML Applications Python API SQL API Java APIs TSet Runtime
Scala APIs User APIs State Task Graph System BSP Operations Mesos Kubernetes HDFS NoSQL Internal (fine grain) DataFlow and State Definition Operations Standalone Slurm Message Brokers
Future Features: Python API critical Digital Science Center Local Resource API Data Access APIs Twister2 APIs in Detail APIs built on top of Task Graph Operator Level APIs Python API Java API Java API Java API Java API TSet
Worker Low level APIs with the most flexibility. Harder to program Higher Level APIs based on Task Graph APIs are built combining different components of the System Digital Science Center Future TSet API Easy to program functional API with type support Task API Abstracts the threads, messages. Intermediate API Operator API
Digital Science Center User in full control, harder to program Performance/Flexibility Ease of Use Twister2 API Levels Suitable for Simple Applications Ex - Pleasingly Parallel Suitable for Complex Applications Ex Graph Analytics 2 http://www.iterativemapreduce.org/ Features of Twister2: HPCforML (Harp, SPIDAL) DFW Communication Twister:Net Dataflow in Twister2 Digital Science Center
Naive Bayes Reduce DAAL Linear Regression Reduce DAAL Ridge Regression Reduce DAAL Multi-class Logistic Regression Regroup, Rotate, AllGather Random Forest AllReduce Principal Component Analysis (PCA) AllReduce DAAL DAAL implies integrated on node with Intel DAAL Optimized Data Analytics Library Digital Science Center 2/13/2019 44 Run time software for Harp broadcast regroup reduce allreduce
allgather push & pull rotate Map Collective Run time merges MapReduce and HPC Digital Science Center 2/13/2019 45 Harp v. Spark Harp v. Torch Datasets: 500K or 1 million data Datasets: 5 million points, 10 thousand points of feature dimension 300 centroids, 10 feature dimensions Running on single KNL 7250 10 to 20 nodes of Intel KNL7250 (Harp-DAAL) vs. single K80 GPU processors
(PyTorch) Harp-DAAL has 15x speedups over Harp-DAAL achieves 3x to 6x Spark MLlib speedups Digital Science Center Harp v. MPI Datasets: Twitter with 44 million vertices, 2 billion edges, subgraph templates of 10 to 12 vertices 25 nodes of Intel Xeon E5 2670 Harp-DAAL has 2x to 5x speedups over state-of-the-art MPI-Fascia solution 2/13/2019 46 Twister2 Dataflow Communications Twister:Net offers two communication models BSP (Bulk Synchronous Processing) message-level communication using TCP or MPI separated from its task management plus extra Harp collectives DFW a new Dataflow library built using MPI software but at data movement not message level Non-blocking
Dynamic data sizes Streaming model Batch case is represented as a finite stream The communications are between a set of tasks in an arbitrary task graph Key based communications Data-level Communications spilling to disks Target tasks can be different from source tasks Digital Science Center BSP and 2/13/2019 for Reduce Operation DFW 47 Twister:Net and Apache Heron and Spark Left: K-means job execution time on 16 nodes with varying centers, 2 million points with 320-way parallelism. Right: K-Means wth 4,8 and 16 nodes where each node
having 20 tasks. 2 million points with 16000 centers used. Latency of Apache Heron and Twister:Net DFW (Dataflow) for Reduce, Broadcast and Partition operations in 16 nodes with 256-way parallelism Digital Science Center 2/13/2019 48 Results Twister2 performance against Apache Flink and MPI for Terasort. Notation : DFW refers to Twister2 BSP refers to MPI (OpenMPI) Digital Science Center Summary of Research in Digital Science Center Interface of HPC/Computation and Machine Learning
Designing, building and using the Global AI and Modeling Supercomputer Cloudmesh build interoperable Cloud systems (von Laszewski) Harp is parallel high performance machine learning (Qiu) Twister2 can offer the major Spark Hadoop Heron capabilities with clean high performance nanoBIO Node build Bio and Nano simulations (Jadhao, Macklin, Glazier) Polar Grid building radar image processing algorithms Other applications Pathology, Precision Health, Network Science, Physics, Analysis of simulation visualizations Try to keep our system infrastructure up to date and optimized for data-intensive problems (fast disks on nodes) Digital Science Center 02/09/2020 50
Example of Push RTSP vs. Example of Pull Windows Smooth Streaming RTSP - Real Time Streaming Protocol What does it do? Allows media player to control transmission of media stream Controls Include Pause/resume, Reposition. Fast Forward RTSP - Out of...
Two-variable data: two data points, one individual/object. For example: weight and height of a 9th grader or MYA score and Math % grade of an Algebra 1 student. Scatterplot: Visual representation of two-variable data. One value is plotted on the...
F.Franek, McMaster University+Terren Corp., Canada V.L.Rosicky, Terren Corp., Canada I.Bruha, McMaster University Presenting TESS (Terren Expert System Shell), tightly coupled with the relational database back-end, used for scheduling of booking actions and decision support of marketing activities in GREENWICHTM software...
The Role of the Church after the Fall of Rome . There is a power vacuum after Rome falls. There are few leaders, little political organization. This left thousands without the economic and political leadership necessary.
This is related to the Diamond and Dybvig (1983) model of bank runs, creating a link between these two strands of the literature. * Obstfeld (1996) discusses various mechanisms that can create the multiplicity of equilibria in a currency-crisis model....
Chem. Sci., 2017, 8, 3453-3464 . IGF-1RK. Apo. Download original PDB file (1p4o, chain A) Extract the target structure. Fix the structure (check the sequence, add hydrogen, regulate side chains of protein residues, make up the missing parts) (PyMoL Show)