Compute (machine learning)

Compute (machine learning)

In machine learning and deep learning, compute is the amount of computing power or computational resources required to train machine learning models and large language models. More broadly, compute is the computational power or resources necessary for a computer or computer program to function. == Definition == Compute is commonly defined as the amount of computing power or computational resources required to train machine learning and large language models. The term "compute" has also been more broadly applied to cloud computing, referencing processing power, memory, networking, storage, and other resources required for the computation of any program. Compute is measured in petaflop/s-days and is used to document AI training. A petaflop/s-day (pfs-day) consists of performing 1015 neural net operations per second for one day, or a total of about 1020 operations. The compute-time product serves as a mental convenience, similar to kilowatt-hour for energy. An amount of compute is meant to give an idea of the number of actual operations performed. == History == In a 2018 analysis titled "AI and compute", artificial intelligence company OpenAI introduced the concept of compute. OpenAI identified two eras of training AI systems in terms of compute-usage. From 1959 to 2012, compute roughly followed Moore’s law. Between 2012 and 2018, the amount of compute used in the largest AI training runs increased exponentially, growing by more than 300,000 times — roughly doubling every 3.4 months. By comparison, Moore’s Law doubled every two years over the same period. One of the largest models, released in 2020, used 600,000 times more computing power than the 2012 model. After 2020, compute growth began to slow down, with the compute needed for the largest AI models continuing to slow down in 2023. The notion of compute has become increasingly used from the mid-2020s onwards. == Compute growth and AI progress == Larger AI models trained on more data and using more computational resources, tend to perform better. This happens even if the algorithms themselves remain unchanged. As early as 2018, OpenAI noted the exponential increase in compute to be have a key role in AI progress. OpenAI considers three factors drive the advance of AI: algorithmic innovation, data, and the amount of compute available for training. AI models with more compute not only improve in the tasks they were trained on but can develop emergent abilities. Incremental improvements can lead to more abrupt leaps in capabilities. AI provider SpaceXAI said in 2026 that their AI progress is driven by compute and used it a key metric in the AI training of its supercomputer Colossus, the which contains 1 million GPUs. Anthropic has a contract of $1.25 billion per month with SpaceXAI to buy all the compute capacity at Colossus 1 data center. === Criticism and policy === Increasing, promoting or constraining progress in artificial intelligence has often be done via controlling the amount of compute. Policymarkers have enacted policies and provided support to make compute resources more accessible to domestic AI researchers. In a January 2022 report, the Center for Security and Emerging Technology (CSET) suggested to institutions that increasingly powerful and generalizable AI (AGI) will likely require other strategies than maximizing compute. Some AI researchers are also concerned that government might exclusively focus on scaling compute instead of other strategies. The CSET has reported on the various bottlenecks which could explain why deep learning needs for compute have slow down: training is expensive and training extremely large models generates traffic jams across many processors that are difficult to manage. there is a limited supply of AI chips (see AI chip memory shortage). CSET advances that the main resource is human capital, specifically talented researchers — according to a 2023 published survey of more than 400 AI researchers, academic and private sector workers. The survey found that AI researchers are not primarily or exclusively constrained by compute access. However, both academic and industry AI researchers equally report concerns that insufficient compute could prevent them from contributing meaningfully to AI research in the future. High compute users are more concerned about compute access. When asked about which resource provided by the government would be the most useful to them, some AI researchers select compute, other prefer grant funding. For this goal, CSET advised policymakers to ensure that even researchers with smaller budgets could effectively contribute to AI research. Other proposed strategies include using contemporary AI algorithms, managing modern AI infrastructure or focusing on interdisciplinary work between the AI field and other fields of computer science. A 2024 study on compute access found that academic-only AI research teams often have less compute intensive research topics, especially foundation models, compared to industry AI labs. As a consequence, academia is likely to play a smaller role in advancing such techniques. The researchers suggest nationally-sponsored computing infrastructure as well as open science initiatives to boost academic compute access. === Data === A 2022 study found that current large language models are significantly under-trained, a consequence of focusing on scaling language models whilst keeping the amount of training data constant. By training over 400 language models of various parameter and token size, they found that "for compute-optimal training", the model size and the number of training tokens should ideally be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

Dabbler

Dabbler is natural media drawing software for beginners. It was initially developed by Fractal Design Corporation. It is a simplified version of Fractal Design Painter, and included multimedia tutorials and a fullscreen interface. Dabbler was released as "Art Dabbler" after the MetaCreations merger, and rights were eventually transferred to Corel. Dabbler operating systems are Mac OS and Microsoft Windows.

Data independence

Data independence is the type of data transparency that matters for a centralized DBMS. It refers to the immunity of user applications to changes made in the definition and organization of data. Application programs should not, ideally, be exposed to details of data representation and storage. The DBMS provides an abstract view of the data that hides such details. There are two types of data independence: physical and logical data independence. The data independence and operation independence together gives the feature of data abstraction. There are two levels of data independence. == Logical data independence == The logical structure of the data is known as the 'schema definition'. In general, if a user application operates on a subset of the attributes of a relation, it should not be affected later when new attributes are added to the same relation. Logical data independence indicates that the conceptual schema can be changed without affecting the existing schemas. == Physical data independence == The physical structure of the data is referred to as "physical data description". Physical data independence deals with hiding the details of the storage structure from user applications. The application should not be involved with these issues since, conceptually, there is no difference in the operations carried out against the data. There are three types of data independence: Logical data independence: The ability to change the logical (conceptual) schema without changing the External schema (User View) is called logical data independence. For example, the addition or removal of new entities, attributes, or relationships to the conceptual schema or having to rewrite existing application programs. Physical data independence: The ability to change the physical schema without changing the logical schema is called physical data independence. For example, a change to the internal schema, such as using different file organization or storage structures, storage devices, or indexing strategy, should be possible without having to change the conceptual or external schemas. View level data independence: always independent no effect, because there doesn't exist any other level above view level. == Data independence == Data independence can be explained as follows: Each higher level of the data architecture is immune to changes of the next lower level of the architecture. The logical scheme stays unchanged even though the storage space or type of some data is changed for reasons of optimization or reorganization. In this, external schema does not change. In this, internal schema changes may be required due to some physical schema were reorganized here. Physical data independence is present in most databases and file environment in which hardware storage of encoding, exact location of data on disk, merging of records, so on this are hidden from user. == Data independence types == The ability to modify schema definition in one level without affecting schema of that definition in the next higher level is called data independence. There are two levels of data independence, they are Physical data independence and Logical data independence. Physical data independence is the ability to modify the physical schema without causing application programs to be rewritten. Modifications at the physical level are occasionally necessary to improve performance. It means we change the physical storage/level without affecting the conceptual or external view of the data. The new changes are absorbed by mapping techniques. Logical data independence is the ability to modify the logical schema without causing application programs to be rewritten. Modifications at the logical level are necessary whenever the logical structure of the database is altered (for example, when money-market accounts are added to banking system). Logical Data independence means if we add some new columns or remove some columns from table then the user view and programs should not change. For example: consider two users A & B. Both are selecting the fields "EmployeeNumber" and "EmployeeName". If user B adds a new column (e.g. salary) to his table, it will not affect the external view for user A, though the internal schema of the database has been changed for both users A & B. Logical data independence is more difficult to achieve than physical data independence, since application programs are heavily dependent on the logical structure of the data that they access.

Data preservation

Data preservation is the act of conserving and maintaining both the safety and integrity of data. Preservation is done through formal activities that are governed by policies, regulations and strategies directed towards protecting and prolonging the existence and authenticity of data and its metadata. Data can be described as the elements or units in which knowledge and information is created, and metadata are the summarizing subsets of the elements of data; or the data about the data. The main goal of data preservation is to protect data from being lost or destroyed and to contribute to the reuse and progression of the data. == History == Most historical data collected over time has been lost or destroyed. War and natural disasters combined with the lack of materials and necessary practices to preserve and protect data has caused this. Usually, only the most important data sets were saved, such as government records and statistics, legal contracts and economic transactions. Scientific research and doctoral theses data have mostly been destroyed from improper storage and lack of data preservation awareness and execution. Over time, data preservation has evolved and has generated importance and awareness. We now have many different ways to preserve data and many different important organizations involved in doing so. The first digital data preservation storage solutions appeared in the 1950s, which were usually flat or hierarchically structured. While there were still issues with these solutions, it made storing data much cheaper, and more easily accessible. In the 1970s relational databases as well as spreadsheets appeared. Relational data bases structure data into tables using structured query languages which made them more efficient than the preceding storage solutions, and spreadsheets hold high volumes of numeric data which can be applied to these relational databases to produce derivative data. More recently, non-relational (non-structured query language) databases have appeared as complements to relational databases which hold high volumes of unstructured or semi-structured data. == Importance == The scope of data preservation is vast. Everything from governmental to business records to art essentially can be represented as data, and is amenable to be lost. This then leads to loss of human history, for perpetuity. Data can be lost on a small or independent scale whether it's personal data loss, or data loss within businesses and organizations, as well as on a larger or national or global scale which can negatively and potentially permanently affect things such as environmental protection, medical research, homeland security, public health and safety, economic development and culture. The mechanisms of data loss are also as many as they are varied, spanning from disaster, wars, data breaches, negligence, all the way through simple forgetting to natural decay. Ways in which data collections can be used when preserved and stored properly can be seen through the U.S. Geological Survey, which stores data collections on natural hazards, natural resources, and landscapes. The data collected by the Survey is used by federal and state land management agencies towards land use planning and management, and continually needs access to historical reference data. == Related Concepts == In contrast, data holdings are collections of gathered data that are informally kept, and not necessarily prepared for long-term preservation. For example, a collection or back-up of personal files. Data holdings are generally the storage methods used in the past when data has been lost due to environmental and other historical disasters. Furthermore, data retention differs from data preservation in the sense that by definition, to retain an object (data) is to hold or keep possession or use of the object. To preserve an object is to protect, maintain and keep up for future use. Retention policies often circle around when data should be deleted on purpose as well, and held from public access, while preservation prioritizes permanence and more widely shared access. Thus, data preservation exceeds the concept of having or possessing data or back up copies of data. Data preservation ensures reliable access to data by including back-up and recovery mechanisms that precede the event of a disaster or technological change. == Methods == === Digital === Digital preservation, is similar to data preservation, but is mainly concerned with technological threats, and solely digital data. Essentially digital data is a set of formal activities to enable ongoing or persistent use and access of digital data exceeding the occurrence of technological malfunction or change. Digital preservation is aware of the inevitable change in technology and protocols, and prepares for data that will need to be accessible across new types of technologies and platforms while the integrity of the data and metadata are being conserved. Technology, while providing great process in conserving data that may not have been possible in the past, is also changing at such a quick rate that digital data may not be accessible anymore due to the format being incompatible with new software. Without the use of data preservation much of our existing digital data is at risk. The majority of methods used towards data preservation today are digital methods, which are so far the most effective methods that exist. === Archives === Archives are a collection of historical documents and records. Archives contribute and work towards the preservation of data by collecting data that is well organized, while providing the appropriate metadata to confirm it. An example of an important data archive is The LONI Image Data Archive, which is an archive that collects data regarding clinical trials and clinical research studies. === Catalogues, directories and portals === Catalogues, directories and portals are consolidated resources which are kept by individual institutions, and are associated with data archives and holdings. In other words, the data is not presented on the site, but instead might act as metadata and aggregators, and may administer thorough inventories. === Repositories === Repositories are places where data archives and holdings can be accessed and stored. The goal of repositories is to make sure that all requirements and protocols of archives and holdings are being met, and data is being certified to ensure data integrity and user trust. Single-site Repositories A repository that holds all data sets on a single site. An example of a major single-site repository the Data Archiving and Networking Services which is a repository which provides ongoing access to digital research resources for the Netherlands. Multi-Site Repositories A repository that hosts data set on multiple institutional sites. An example of a well known multi-site repository is OpenAIRE which is a repository that hosts research data and publications collaborating all of the EU countries and more. OpenAIRE promotes open scholarship and seeks to improves discover-ability and re-usability of data. Trusted Digital Repository A repository that seeks to provide reliable, trusted access over a long period of time. The repository can be single or multi-sited but must cooperate with the Reference Model for an Open Archival Information System, as well as adhere to a set of rules or attributes that contribute to its trust such as having persistent financial responsibility, organizational buoyancy, administrative responsibility security and safety. An example of a trusted digital repository is The Digital Repository of Ireland (DRI) which is a multi-site repository that hosts Ireland's humanity and social science data sets. === Cyber Infrastructures === Cyber infrastructures which consists of archive collections which are made available through the system of hardware, technologies, software, policies, services and tools. Cyber infrastructures are geared towards the sharing of data supporting peer-to-peer collaborations and a cultural community. An example of a major cyber-infrastructure is The Canadian Geo-spatial Data Infrastructure which provides access to spatial data in Canada.

Information security

Information security is the practice of protecting information by mitigating information risks. It is part of information risk management. It typically involves preventing or reducing the probability of unauthorized or inappropriate access to data or the unlawful use, disclosure, disruption, deletion, corruption, modification, inspection, recording, or devaluation of information. It also involves actions intended to reduce the adverse impacts of such incidents. Protected information may take any form, e.g., electronic or physical, tangible (e.g., paperwork), or intangible (e.g., knowledge). Information security's primary focus is the balanced protection of data confidentiality, integrity, and availability (known as the CIA triad, unrelated to the US government organization) while maintaining a focus on efficient policy implementation, all without hampering organization productivity. This is largely achieved through a structured risk management process. To standardize this discipline, academics and professionals collaborate to offer guidance, policies, and industry standards on passwords, antivirus software, firewalls, encryption software, legal liability, security awareness and training, and so forth. This standardization may be further driven by a wide variety of laws and regulations that affect how data is accessed, processed, stored, transferred, and destroyed. While paper-based business operations are still prevalent, requiring their own set of information security practices, enterprise digital initiatives are increasingly being emphasized, with information assurance now typically being dealt with by information technology (IT) security specialists. These specialists apply information security to technology (most often some form of computer system). IT security specialists are almost always found in any major enterprise/establishment due to the nature and value of the data within larger businesses. They are responsible for keeping all of the technology within the company secure from malicious attacks that often attempt to acquire critical private information or gain control of the internal systems. There are many specialist roles in Information Security including securing networks and allied infrastructure, securing applications and databases, security testing, information systems auditing, business continuity planning, electronic record discovery, and digital forensics. == Standards == Information security standards are guidelines generally outlined in published materials that aim to protect a user's or an organization's cyber environment from threats. This environment includes the users themselves, hardware such as devices and networks, software such as applications or services, and any information in storage or transit. These standards comprise security concepts, technologies, and guidelines to deal with an adverse event. They may also include assessment criteria and certification for organizations implementing a minimum level of security. These standards are developed by various international and national bodies to prevent or mitigate cyber-attacks, ensure consistency among developers, and establish a minimum standard in industries susceptible to an attack. The ISO/IEC 27000 family, published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), provides information about the guidelines and requirements for an Information Security Management System (ISMS). The Common Criteria (ISO/IEC 15408) provides guidelines on evaluating and certifying the security of a system. The IEC 62443 establishes security standards for automation and control systems. Similarly, the ISO/SAE 21434, ETSI EN 303 645, and EN 18031 provide standards for road vehicles, the Internet of Things, and radio-based systems respectively. The NIST Cybersecurity Framework (NIST CSF) is a set of guidelines developed by the U.S. National Institute of Standards and Technology to help organizations with risk management. NIST also publishes various Federal Information Processing Standards (FIPS) and Special Publications. The United Kingdom has introduced Cyber Essentials, which is a certification scheme to protect organizations against common security threats. The Australian Cyber Security Centre publishes the Essential Eight mitigation strategies. The Payment Card Industry Data Security Standard (PCI DSS) regulates handling of cardholder data in order to reduce credit card fraud. UL has published standards related to specific industries such as UL 2900-2-3 for security and life safety signaling systems and UL-2900-2-1 for healthcare and wellness systems. == Threats == Information security threats come in many different forms. Some of the most common threats today are software attacks, theft of intellectual property, theft of identity, theft of equipment or information, sabotage, and information extortion. Viruses, worms, phishing attacks, and Trojan horses are a few common examples of software attacks. The theft of intellectual property has also been an extensive issue for many businesses. Identity theft is the attempt to act as someone else usually to obtain that person's personal information or to take advantage of their access to vital information through social engineering. Sabotage usually consists of the destruction of an organization's website in an attempt to cause loss of confidence on the part of its customers. Information extortion consists of theft of a company's property or information as an attempt to receive a payment in exchange for returning the information or property back to its owner, as with ransomware. One of the most functional precautions against these attacks is to conduct periodical user awareness. Governments, military, corporations, financial institutions, hospitals, non-profit organizations, and private businesses amass a great deal of confidential information about their employees, customers, products, research, and financial status. Should confidential information about a business's customers or finances or new product line fall into the hands of a competitor or hacker, a business and its customers could suffer widespread, irreparable financial loss, as well as damage to the company's reputation. From a business perspective, information security must be balanced against cost; the Gordon-Loeb Model provides a mathematical economic approach for addressing this concern. For the individual, information security has a significant effect on privacy, which is viewed very differently in various cultures. == History == Since the early days of communication, diplomats and military commanders understood that it was necessary to provide some mechanism to protect the confidentiality of correspondence and to have some means of detecting tampering. Julius Caesar is credited with the invention of the Caesar cipher c. 50 B.C., which was created in order to prevent his secret messages from being read should a message fall into the wrong hands. However, for the most part protection was achieved through the application of procedural handling controls. Sensitive information was marked up to indicate that it should be protected and transported by trusted persons, guarded and stored in a secure environment or strong box. As postal services expanded, governments created official organizations to intercept, decipher, read, and reseal letters (e.g., the U.K.'s Secret Office, founded in 1653). In the mid-nineteenth century more complex classification systems were developed to allow governments to manage their information according to the degree of sensitivity. For example, the British Government codified this, to some extent, with the publication of the Official Secrets Act in 1889. Section 1 of the law concerned espionage and unlawful disclosures of information, while Section 2 dealt with breaches of official trust. A public interest defense was soon added to defend disclosures in the interest of the state. A similar law was passed in India in 1889, The Indian Official Secrets Act, which was associated with the British colonial era and used to crack down on newspapers that opposed the Raj's policies. A newer version was passed in 1923 that extended to all matters of confidential or secret information for governance. By the time of the First World War, multi-tier classification systems were used to communicate information to and from various fronts, which encouraged greater use of code making and breaking sections in diplomatic and military headquarters. Encoding became more sophisticated between the wars as machines were employed to scramble and unscramble information. The establishment of computer security inaugurated the history of information security. The need for such appeared during World War II. The volume of information shared by the Allied countries during the Second World War necessitated formal alignment of classification systems and procedural controls. An arcane range of markings evol

Source-code editor

A source-code editor is a text editor program designed specifically for editing the source code of computer programs. It includes basic functionality such as syntax highlighting, and sometimes debugging. It may be a standalone application or it may be built into an integrated development environment (IDE). == Features == Source-code editors have features specifically designed to simplify and speed up typing of source code, such as syntax highlighting(syntax error highlighting), auto indentation, autocomplete and brace matching functionality. These editors may also provide a convenient way to run a compiler, interpreter, debugger, or other program relevant for the software-development process. While many text editors like Notepad can be used to edit source code, if they do not enhance, automate or ease the editing of code, they are not defined as source-code editors. Structure editors are a different form of a source-code editor, where instead of editing raw text, one manipulates the code's structure, generally the abstract syntax tree. In this case features such as syntax highlighting, validation, and code formatting are easily and efficiently implemented from the concrete syntax tree or abstract syntax tree, but editing is often more rigid than free-form text. Structure editors also require extensive support for each language, and thus are harder to extend to new languages than text editors, where basic support only requires supporting syntax highlighting or indentation. For this reason, strict structure editors are not popular for source code editing, though some IDEs provide similar functionality. A source-code editor can check syntax dynamically while code is being entered and immediately warn of syntax problems, as well as suggest code autocomplete snippets. A few source-code editors compress source code, typically converting common keywords into single-byte tokens, removing unnecessary whitespace, and converting numbers to a binary form. Such tokenizing editors later uncompress the source code when viewing it, possibly prettyprinting it with consistent capitalization and spacing. A few source-code editors do both. The Language Server Protocol, first used in Microsoft's Visual Studio Code, allows for source code editors to implement an LSP client that can read syntax information about any language with a LSP server. This allows for source code editors to easily support more languages with syntax highlighting, refactoring, and reference finding. Many source code editors such as Neovim and Brackets have added a built-in LSP client while other editors such as Emacs, Vim, and Sublime Text have support for an LSP Client via a separate plug-in. == History == In 1985, Mike Cowlishaw of IBM created LEXX while seconded to the Oxford University Press. LEXX used live parsing and used color and fonts for syntax highlighting. IBM's LPEX (Live Parsing Extensible Editor) was based on LEXX and ran on VM/CMS, OS/2, OS/400, Windows, and Java Although the initial public release of vim was in 1991, the syntax highlighting feature was not introduced until version 5.0 in 1998. On November 1, 2015, the first version of NeoVim was released. In 2003, Notepad++, a source code editor for Windows, was released by Don Ho. The intention was to create an alternative to the java-based source code editor, JEXT In 2015, Microsoft released Visual Studio Code as a lightweight and cross-platform alternative to their Visual Studio IDE. The following year, Visual Studio Code became the Microsoft product using the Language Server Protocol. This code editor quickly gained popularity and emerged as the most widely used source code editor. == Comparison with IDEs == A source-code editor is one component of a Integrated Development Environment. In contrast to a standalone source-code editor, an IDE typically also includes several tools which enhance the software development process. Such tools include syntax highlighting, code autocomplete suggestions, version control, automatic formatting, integrated runtime environments, debugger, and build tools. Standalone source code editors are preferred over IDEs by some developers when they believe the IDEs are bloated with features they do not need. == Notable examples == == Controversy == Many source-code editors and IDEs have been involved in ongoing user arguments, sometimes referred to jovially as "holy wars" by the programming community. Notable examples include vi vs. Emacs and Eclipse vs. NetBeans. These arguments have formed a significant part of internet culture and they often start whenever either editor is mentioned anywhere.

Media contacts database

In public relations (PR) and marketing, a media contacts database is a resource which catalogs the names, contact information, and other details about people who work in various media professions. These include journalists, reporters, editors, publishers, contributors, freelance journalists, opinion writers, social media personalities/ influencers, TV show anchors, radio show hosts, DJs, and others. A media contacts database usually contains the following information: Full name of the media contact, The publication or channel they work for Designations (past and present) Topics they cover, or their beat Contact information found in public domains Online presence like blogs and other social networking sites Education Information == Overview == A media contacts database is a public relations tool that is maintained and used by PR professionals to pitch stories on a particular topic, product, or company to a specific group of journalists. These journalists would then write or speak about the particular topic in a relevant issue or episode of their shows. A media contacts database allows a PR professional to gain easy access to hundreds of journalists within a short span of time. Media contacts database are created and sold by many media research companies that offer such PR software for professionals.