June 2025
S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

The Risks of Using OpenAI

April 18, 2025

I never imagined that the first blog post in a long time wouldn’t be about my research but about recent events that are serving as an impediment to me completing what’s been a very long development cycle. In all of the discussions of AI with which I’m bombarded daily, I haven’t seen anyone talk about this risk: the volatility of the companies themselves.

I’ve been using OpenAI’s products for several years now. They’ve become an integral part of my research and while I have also used several other such models, I tend to come back to OpenAI repeatedly. I pay for the top tier of ChatGPT (which is a solid research tool,) and I’ve been using API access as a critical component of what drives my research (even though AI is not my research, it is the tool that demonstrates why what I’ve been doing matters.)

Two days ago, OpenAI changed API access to a number of their models to remove certain functionality unless you go through a “verification” process. I admit, I’ve seldom had one of these automated processes go well but in this case, OpenAI has done the worst job I could imagine. I used my driving licence and it then wanted me to take pictures of my face with my phone. Whoever thought that asking you to push a button on the screen while you’re looking away from the screen was a good idea is probably a graduate of the Marquis du Sade school of UX/UI design. Or they’re really systems software developers (who believe a CLI is “all anyone needs.”)

Don’t get me wrong, I chuckle watching these coding editors using “sed” as most developer now don’t know why it’s called sed (ahem) let alone using ed (the editor from UNIX – it was certainly there in V6, but from what I can tell it went away at some point.) Aside: the memories of using ed to edit files on a hard TTY (greenbar tractor feed paper, with a keyboard) are still lurking the recesses of my memory.

Turns out the verification “failed” due to some sort of technical issue so the third party provider working on OpenAI’s behalf said “you’ll have to get a new verification link.”

I was eventually able to find a chat interface to OpenAI’s support (their discord has a “bug reports” channel but I’m not allowed to post to it.) The answer from support was:

If your verification failed, you won’t be able to access certain features—like GPT-4o and other O Series models—via the API. These models require a verified organization to use. For more details, please refer to our article here: API organization verification..
Regards,
Sneha J
OpenAI Support

Hence, a very real risk of relying upon any AI supplier now is that they’re inherently unstable. I read that this new verification was added to prevent companies like Deepseek from using OpenAI to do model training against their model. Lesson: OpenAI (and likely all of the providers) are risky to rely upon because they can quickly change directions and leave you without any resolution.

I decided to dig more (which is not what I should be doing, but this one really has me annoyed. Partially it is because I really do not like this perspective of companies that seem to think “this would be a great business if it weren’t for the customers.”) Naturally, I decided to turn to ChatGPT. I’ll link the entire conversation (in case you want to dig into it) but for the post I’ll summarize my findings about these sorts of “identity verification services”. Note that I did this as a “deep research” request.

The response is long. I’ll summarize it here and include the full text following.

Summary (provided via ChatGPT, reviewed by me, not modified):

Camera-based identity verification systems (such as those offered by Persona, Onfido, Jumio, and ID.me) have made significant progress, achieving high accuracy—often exceeding 99%—in controlled conditions, according to industry benchmarks and third-party evaluations. However, real-world results are more variable, with legitimate users frequently facing verification failures due to common practical issues such as poor lighting, unclear photos, device limitations, or overly strict one-time verification attempts. These problems disproportionately affect certain demographics and older devices. While top-tier providers implement sophisticated biometric comparisons and liveness detection to reduce errors, the best user experiences combine automated checks with manual fallback processes and reasonable retry policies, highlighting that reliability depends as much on thoughtful implementation and fairness as on technical accuracy.

Query and Response

Here’s a public link to the entire conversation:

’ll research the accuracy and reliability of identity verification services that use camera-based biometric comparison and document scanning—both generally across the industry and specifically for withpersona.com, if data is available.

I’ll also look for:

Reported accuracy metrics (false positives/negatives)
Common failure scenarios (e.g., document positioning, lighting, chip scanning)
Any published benchmarks, user feedback, or audits
Regulatory commentary or standards (e.g., NIST guidelines)

I’ll let you know what I find!

Accuracy and Reliability of Camera-Based Identity Verification

Overview of Camera-Based Identity Verification

Camera-based third-party identity verification (IDV) typically involves two key steps: document verification (scanning a photo ID for authenticity) and selfie biometrics (capturing a live facial image to compare with the ID photo). This process is widely used for remote Know Your Customer (KYC) and user onboarding. Modern services combine optical checks (OCR of text, security feature detection, and even RFID chip reading on e-passports) with facial recognition (matching the selfie to the ID photo) and often include liveness detection to ensure the person is real and present. The goal is to accurately approve legitimate users while catching impostors, which is measured by metrics like false acceptance and rejection rates, as described below.

Key Accuracy Metrics and Industry Benchmarks

False Acceptance Rate (FAR) and False Rejection Rate (FRR) are the standard metrics for accuracy. FAR is the percentage of fraudulent or mismatched users that are erroneously accepted (i.e. impostors slipped through), while FRR is the percentage of legitimate users incorrectly rejected (false negatives). An ideal system has both FAR and FRR as low as possible, but in practice there is a trade-off – tightening thresholds to reduce FAR often raises FRR. Many providers report confidence scores for matches, and clients choose a threshold (e.g. require similarity score above X% to pass).

Vendor-Reported Performance: Top vendors claim extremely high accuracy. For example, Onfido (a major IDV provider) announced that with its latest AI system (“Motion” with Atlas AI), false rejection and false acceptance rates are both below 0.1% (Onfido launches the next generation of facial biometric technology | Onfido). Onfido also reports improving its false acceptance rate 10× to ~0.01% on average after bias-mitigation efforts (Onfido’s Real Identity Platform Improves Performance by 12x | Onfido). Jumio, another leading provider, integrates iProov’s liveness technology; iProov is known for “industry-leading accuracy and low false rejection rates” that yield high user pass rates (Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform | iProov). Socure (which offers a combined document and data-driven verification) claims up to 99% identity verification success rates for mainstream user populations (Socure Verifies Over 2.7 Billion Identity Requests in 2024, Achieves …). Persona (withpersona.com) has not publicly disclosed specific error rates, but its platform is certified by independent labs – their selfie liveness passed rigorous iBeta Level-2 testing with 0% spoof acceptance (APCER) () and was evaluated by NIST and DHS, indicating high accuracy. Persona emphasizes that its models showed “no material bias across age, sex, or skin tone” in internal and third-party tests (Industry-Leading, Lab-Certified Face Recognition | Persona).

Independent Benchmark Results: Neutral evaluations reveal a wider range of performance across the industry:

A 2024 General Services Administration (GSA) study (with Clarkson Univ.) tested five major remote IDV solutions (selfie-to-ID match) on ~4,000 people. Results varied dramatically – the best system had about 10.5% false negative (false rejection) rate (with a 95% confidence interval ±4.5%), while another system had over 50% false rejection rate for genuine users (GSA testing finds variations in the accuracy of digital ID verification tech – Nextgov/FCW) ([2409.12318] A large-scale study of performance and equity of commercial remote identity verification technologies across demographics). In other words, some solutions were fairly accurate, but at least one failed half of legitimate users, highlighting reliability issues in parts of the industry. Notably, two out of five solutions were deemed “equitable” (no significant performance bias across demographics), but others showed higher failure rates for certain groups (e.g. one vendor had significantly higher rejection of Black/African American users and those with darker skin tones) ([2409.12318] A large-scale study of performance and equity of commercial remote identity verification technologies across demographics).
The DHS Science & Technology RIVTD evaluation (2023-2024) provides a detailed benchmark of top vendors’ capabilities in document authentication, face matching, and spoof detection. In the document authenticity test, 12 document verification systems were tested with 1,000 genuine IDs and 1,000 fakes on various smartphones. The performance depended on the ID issuer and device used; DHS recommended choosing systems with <10% error rates for document detection (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update) (this threshold is being incorporated into NIST SP 800-63-4 guidelines). In the face match test (selfie vs. ID photo), many algorithms performed extremely well under controlled conditions: over half of the systems matched legitimate users with >99% accuracy, achieving False Non-Match Rates below 1% at a strict False Match Rate of 0.01% (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). (In fact, 9 out of 16 algorithms had FNMR <1% at FAR 1:10,000 (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update).) However, there were also low performers (one “worst” system failed to match any IDs correctly). On average, 55% of errors stemmed from the ID photo extraction (poor image capture from the document) (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). The impostor test in this study found that when a random person’s selfie was tested against someone else’s ID, systems correctly rejected those attempts >99.99% of the time on many systems (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). But if the impostor shared similar demographics to the victim, the success rate of impostors was 10× higher (a sign that impostors who look similar to the target are harder to catch) (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update).
For liveness/presentation attack detection (PAD), the DHS test showed wide variance in user experience. Active liveness systems (those asking users to turn their head or perform an action) had bona fide pass rates ranging from only 41% up to 94%, with a median around 85% (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). This means some active systems were very finicky (failing almost 60% of real users in the worst case), whereas the best were quite seamless. Interestingly, error rates were higher for older users – 9% for ages 18–45 vs. 20% for 46+ in active liveness checks (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). Passive liveness systems (which analyze a selfie image or short video clip without explicit user action) generally operated faster and more consistently; the median passive system had virtually 0% error for real users, though the worst passive system only succeeded 62% of the time (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). In terms of security, at least two active and two passive systems caught 100% of spoof attacks in the test (i.e. 0% spoof acceptance), whereas one poor system failed to detect 88% of spoof attempts (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). Device type had a significant impact on both usability and security performance for liveness – results varied widely across different smartphones, indicating hardware/camera differences can affect reliability (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update).

Summary of Provider Accuracy (Approximate):

Provider	False Rejection Rate (FRR) – Legit User Failures	False Acceptance Rate (FAR) – Fraud Passed	Notable Accuracy Credentials
Onfido (Motion platform)	< 0.1% (vendor reported) ([Onfido launches the next generation of facial biometric technology	Onfido](https://onfido.com/press-release/onfido-launches-motion-the-next-generation-of-facial-biometric-technology/#:~:text=The%20Onfido%20Real%20Identity%20Platform,1))	< 0.1% (vendor reported) ([Onfido launches the next generation of facial biometric technology
Persona (withpersona.com)	Not published (estimated high 90s% pass rate)	Not published (threshold adjustable)	iBeta Level-2 liveness passed (0% spoof success) (); NIST-tested face match; in-house bias testing shows no material bias ([Industry-Leading, Lab-Certified Face Recognition
Jumio (w/ iProov liveness)	~1% or lower (implied)	Very low (implied)	Uses iProov Genuine Presence Assurance; known for high conversion and “low false rejection” ([Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform
ID.me (US govt-focused)	~10–15% (automated process failure) ([A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me	CyberScoop](https://cyberscoop.com/irs-facial-recognition-identity-privacy/#:~:text=According%20to%20ID,not%20available%20to%20IRS%20customers))	N/A (fallback to manual)

Sources: Onfido and Jumio figures from vendor press releases; ID.me figure from House Oversight Committee findings (ID.me disclosed ~10–15% of users could not be verified via automated selfie match) (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop). Persona figures from lab certifications and Persona’s disclosures.

Common Causes of Failures and False Rejections

Even with advanced algorithms, failed verifications happen for a variety of practical reasons:

User Error & Poor Image Quality: The most common cause of failure is low-quality images provided by users (Buyer’s Guide to Identity Verification Solutions | Persona). If the camera capture is blurry, too small, or has glare, the system may be unable to read the ID or identify the face. Persona notes that “one of the most common reasons users fail verification checks is because their photos aren’t detailed enough for the tool to read.” (Buyer’s Guide to Identity Verification Solutions | Persona) This can be due to a dirty camera lens, low lighting, or motion blur, which make the ID text or face illegible. In such cases, the document or selfie will be rejected as “not legible.” (Complete Guide to Document Verification: Process, Benefits & Compliance | Persona) Many systems try to guide users (e.g. on-screen prompts to retake a blurry photo), but if the user does not or cannot capture a clear image, false rejections rise.
Document Issues – Wear, Expiry, or Mismatch: Physical IDs that are damaged, worn, or have faint print can cause the automated checks to fail. An ID with worn-out security features or a scratched photo might not pass authenticity verification. Expired documents are usually rejected by policy (Complete Guide to Document Verification: Process, Benefits & Compliance | Persona). Additionally, if the document data doesn’t match the user’s input data (name, DOB, etc.), the system flags a problem (Complete Guide to Document Verification: Process, Benefits & Compliance | Persona). These lead to legitimate users being flagged if, for example, they changed their name or input a nickname that doesn’t exactly match their ID. (Persona’s guide gives the example that a legal name change not reflected across documents can trigger a false negative (Complete Guide to Document Verification: Process, Benefits & Compliance | Persona).)
Lighting and Environment: Good lighting and camera focus are critical. Backlighting or extreme glare on an ID can foil OCR and face detection. Conversely, very low light or shadows can obscure a face. Many mobile verification SDKs now include auto-capture and exposure adjustment – e.g. waiting until the ID is steady and focused before snapping, or asking the user to move to a brighter area. Still, in real-world use, users may attempt verification in suboptimal conditions (nighttime indoor lighting, etc.) leading to higher failure rates.
Hardware and Device Variation: The type of camera or phone used can significantly affect reliability. Higher-end smartphones with good cameras tend to produce better results. The DHS RIVTD tests found performance differences across phone models were so large that they specifically recommend vendors ensure broad device compatibility (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update) (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). Older phones may have low resolution or poor autofocus. Additionally, if using NFC to read a passport’s RFID chip, not all phones have NFC readers, and even when they do, aligning the chip correctly can be tricky for users (leading to failure to acquire the chip data). Some providers report Failure to Acquire rates as part of their metrics for chip reading. For example, FIDO Alliance has testing concepts for Document Failure-to-Acquire Rate (when the system can’t even read/scan the document) (Document Authenticity Verification Requirements – FIDO Alliance) – a high failure-to-acquire can translate to user friction. Browser or OS issues can also interfere (e.g. a user not granting camera permission, outdated browser not supporting the video feed, etc., will prevent the capture altogether).
Liveness and Anti-Spoof Sensitivity: Liveness detection adds another potential point of failure. Systems must balance security vs. convenience. If liveness checks are too strict, they might falsely flag genuine users (e.g. unusual lighting might be misinterpreted as a mask or screen spoof). For instance, some face liveness AI might mis-read glasses glare as a spoof attempt or think a low-resolution selfie is a “replay” attack (What is Selfie Identity Verification? | Persona). Persona cautions that even advanced liveness can produce “false negatives” – e.g. “someone’s eyeglasses may fool the system… or a low-resolution photo may trick the system into thinking it’s a digital replay” (What is Selfie Identity Verification? | Persona). In those cases, a real user could be wrongly rejected. The DHS active liveness results (41–94% genuine pass rate) underline how some implementations are much more user-friendly than others (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update). Generally, active liveness (asking the user to turn their head, blink, etc.) provides strong security but can fail if the user doesn’t follow instructions correctly or has mobility issues. Passive liveness (just analyzing a static selfie or short video) is easier on the user but technically challenging; the best passive systems achieve low errors, but others might either mistakenly reject users or allow spoofs if not robust (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update).
One-Time Attempts and Rigid Processes: Some verification flows only give users a single attempt or a very limited number of retries, which can lead to high drop-off if that attempt fails. For security, a business might choose to lock the verification after one failed try to prevent attackers from brute-forcing the process. However, this “one-and-done” policy can be harsh on legitimate users who made a mistake. Real-world user reports illustrate this pain point: for example, during state unemployment verifications with ID.me, some users were only allowed one automated attempt and then had to wait in a long queue for a video call after a failure (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU). If no alternative path is provided, a single camera glitch can force the user out of the fast lane entirely. User experience reports on forums frequently cite frustration when “the system won’t accept my ID” and there’s no clear recourse – e.g., LinkedIn’s Persona-based verification initially gave some users trouble with few retry options or support channels (Linkedin and Persona : r/privacy – Reddit). A balanced approach is to allow a couple of retakes for issues like glare, while still halting endless tries for security.

Provider Comparison and Approaches

The IDV industry has several major providers – including Persona, Jumio, Onfido, and ID.me (mentioned by name), as well as others like Veriff, Onfido, Socure, LexisNexis, Incode, etc. – each with different strengths. Below is a brief comparison of how they stack up on accuracy and reliability, and their known approaches:

Persona (withpersona.com): Persona offers an integrated, multi-layer verification with document scanning, face recognition, database checks, etc. Rather than disclosing a single “accuracy” percentage, Persona highlights its ensemble of models and independent certifications. According to Persona, their face matching and liveness AI were tested by NIST and DHS and showed “defense-grade” results (Industry-Leading, Lab-Certified Face Recognition | Persona). They achieved iBeta Level 2 compliance (meaning their liveness detection blocked 100% of spoof attempts in a certified lab test) (). Persona also focuses on bias mitigation – they’ve written about measuring and reducing demographic bias in face verification, and claim their models show no significant performance gaps across different ages, genders, or skin tones (Industry-Leading, Lab-Certified Face Recognition | Persona). In practice, Persona’s clients can configure the strictness of checks (e.g. what confidence score to accept, whether to require NFC chip verification, etc.). This flexibility allows tuning fraud vs. conversion trade-offs. Persona emphasizes conversion and user experience (auto-capturing images at the right moment, providing real-time feedback to correct issues) to reduce accidental failures. For example, their SDK provides “dynamic error handling” and tips to improve image quality (Industry-Leading, Lab-Certified Face Recognition | Persona). Real-world use: Persona is used by companies like cryptocurrency exchanges, fintech apps, and even LinkedIn for ID verification. Users have reported occasional friction (such as needing multiple tries to get a clear photo), but Persona’s system typically does allow retry attempts and even manual review options if automated steps fail (Persona ID verification : r/linkedin – Reddit). Overall, Persona’s accuracy appears on par with top-tier vendors, given its certifications, though exact FAR/FRR stats aren’t public.
Onfido: Onfido is a well-established provider that has invested heavily in AI (their “Atlas” AI engine) and has a large global customer base. Onfido’s published metrics are impressive: their system is 95% automated, with most checks done in seconds (Onfido’s Real Identity Platform Improves Performance by 12x | Onfido). They report <0.1% FAR and FRR under optimal settings (Onfido launches the next generation of facial biometric technology | Onfido), and have demonstrated improvements in reducing bias (a 10× reduction in false accept disparities) (How Onfido mitigates AI bias in facial recognition). Onfido’s liveness and face match are also iBeta Level 2 certified and were audited for fairness by the UK’s ICO (Onfido’s Real Identity Platform Improves Performance by 12x | Onfido). One notable aspect is Onfido’s “Motion” active liveness, which has the user turn their head in a short video; this was introduced to combat deepfakes and 3D mask attacks, and was found compliant with the stringent ISO 30107-3 standard (Onfido launches the next generation of facial biometric technology | Onfido) (Onfido launches the next generation of facial biometric technology | Onfido). In customer deployments, Onfido often balances auto-approval with fallback to manual review for edge cases. They promote a “hybrid” approach where AI handles the bulk of verifications and questionable cases get human review – this helps keep false rejects low without letting fraud through. Onfido’s scale (tens of millions of verifications per year) suggests their reported error rates are averaged over many scenarios; certain populations or documents might see higher friction, but they continuously retrain on new data. In sum, Onfido is viewed as having very high accuracy (enterprise-grade) and is frequently benchmarked in analyst reports. In Gartner’s 2024 Critical Capabilities report, for instance, Onfido and Persona both ranked highly (Persona was noted as a top performer across use cases) in part due to their accuracy and flexibility.
Jumio: Jumio is another leading vendor known for a broad identity platform. Historically, Jumio’s selfie verification had solid accuracy, and in 2021 they partnered with iProov to further enhance liveness and face match performance (Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform | iProov) (Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform | iProov). iProov’s system (which uses a brief face scan with illuminated colors) is designed for inclusivity and high pass rates – it’s used by government agencies like DHS and the UK Home Office (Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform | iProov). This indicates strong performance under strict testing. While Jumio hasn’t publicly quoted specific error rates recently, the integration of iProov suggests false rejection rates well under 1-2% in practice and excellent spoof resistance. (iProov’s own tests with governments showed near-perfect spoof blocking and 99%+ completion rates, even for older users or those less tech-savvy (Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform | iProov).) Jumio’s document verification is also respected; they were an early mover in ID authenticity checks and likely participated in the DHS or other evaluations (though results were anonymized). A U.S. GSA privacy assessment in 2023 listed Jumio among vendors to be studied for equity (GSA testing finds variations in the accuracy of digital ID verification tech – Nextgov/FCW). Jumio also supports NFC scanning for passports and even Face Match against chip data – using the high-resolution photo in the RFID chip to compare with the selfie, which can improve match accuracy when available. Overall, Jumio’s strategy is to offer high-security, compliance-focused solutions (they often highlight GDPR compliance and data security), while leveraging top-tier biometric tech for reliability. Clients of Jumio (banks, airlines, etc.) often report good verification rates, but like others, issues can arise with users on old devices or unfamiliar with the process (which Jumio mitigates with UI guidance and their “Netverify” SDK’s automatic capture features).
ID.me: ID.me is somewhat unique as it has been heavily used by government agencies in the US (e.g. IRS, state unemployment systems) and has a mixed reputation in terms of user experience. On the one hand, ID.me’s identity verification has caught a huge amount of fraud during the pandemic (they claim to have blocked substantial identity theft attempts, including by requiring selfies where criminals had only stolen documents). On the other hand, ID.me’s automated face match has a relatively high false rejection rate, requiring many users to undergo manual video calls. In testimony to Congress, ID.me revealed that 10–15% of users could not be verified through the automated selfie-matching process (Chairs Maloney, Clyburn Release Evidence Facial Recognition …). In real numbers, ID.me has stated that only ~70-85% of people complete verification self-serve for certain programs, and the rest need human intervention (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop). Those who fail initially must join a video chat with a “Trusted Referee,” which led to backlogs — there were reports of users waiting hours or even being “booted out” of virtual queues due to technical difficulties (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU). The high FRR can be attributed to a combination of factors: many users verifying were not in ideal conditions (some had limited broadband or older devices), and ID.me’s system settings erred on the side of fraud prevention, possibly with stricter thresholds. ID.me’s CEO has claimed their face match algorithm (provided by Paravision) is very accurate in lab terms – he cited false match error rates “as low as less than 1%, with insignificant variation across race/sex”, for 1:1 matching (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop). However, real-world conditions (aging of ID photos, low-quality selfies, user errors) meant about a 10-15% false non-match rate in practice. ID.me also controversially was using a 1:many face search (comparing the selfie against a larger database to prevent duplicate identities) which is generally less accurate and raised privacy concerns (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop). They have since downplayed this 1:many usage after backlash. From an industry perspective, the ID.me case underscores that even if an algorithm is top-tier, operational decisions (like offering no in-person alternative, or allowing only one try) can impact effective reliability. Regulators and advocacy groups (ACLU, etc.) noted ID.me’s system was “nearly universally reviled by users for its poor service and difficult verification process.” (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU) The company has responded by increasing its support staff and claiming to have cut video call wait times by 86% and average waits under 10 minutes (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop). Going forward, the federal government (Login.gov) is exploring other solutions, emphasizing that any chosen system must be equitable and highly accurate for all users.
Other Notable Providers: Socure Verify, Onfido, Incode, LexisNexis ThreatMetrix, TransUnion, Veriff, Microsoft (Azure AD Verify), Google (Cloud Identity Toolkit), etc., all offer ID verification services with broadly similar technology. Many have published case studies or white papers with glowing statistics, but fewer have third-party audits available. For example, Socure asserts that its AI-based approach (which combines document verification with extensive data source cross-checks) achieves +8-10% higher verification rates for “hard-to-identify” demographics compared to competitors (Socure Launches Compliance Product Suite to Optimize ID …) – indicating a focus on maximizing inclusivity. Incode and HyperVerge have boasted about meeting all benchmarks in the DHS tests (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update), suggesting top-tier accuracy. What distinguishes providers often is the workflow flexibility and fallback procedures: e.g., some offer integrated manual review services or allow additional identity evidence (like a second ID, a utility bill, or knowledge-based questions) if the primary check fails. These mitigations can raise overall success rates. Providers also differentiate with geographical coverage (ability to recognize IDs from many countries), compliance certifications (GDPR, SOC 2, ISO27001), and whether they keep data onshore for certain jurisdictions. All these factors play into reliability – e.g., an IDV service that can read the MRZ on an international passport and verify its chip will be more reliable for foreign users than one that only knows US IDs.

Regulatory Standards and Evaluations

Given the critical role of ID verification in security and access, standards and regulations have emerged to guide accuracy and fairness requirements:

NIST SP 800-63: The National Institute of Standards and Technology’s Special Publication 800-63 (in particular 800-63A) provides a framework for digital identity proofing at various Identity Assurance Levels (IAL). For remote ID verification (IAL2/IAL3), NIST recommends the use of document authentication plus biometric comparison. While SP 800-63-3 (current as of 2017) doesn’t mandate specific error rates, the draft SP 800-63-4 is expected to incorporate findings from recent evaluations. As noted, DHS’s benchmark suggested selecting systems with <10% document and biometric error rates (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update), so future guidance may explicitly call for vendors meeting that bar. NIST also runs the Face Recognition Vendor Tests (FRVT) – an ongoing benchmark of face matching algorithms. Many vendors (or the algorithm suppliers they use) participate, and top algorithms in 1:1 verification now achieve extremely low error rates (FNMR well below 0.5% at FAR=1e-6 in some cases). However, these FRVT tests use high-quality images; NIST acknowledges that real-world performance will be worse due to capture issues. Still, agencies refer to FRVT rankings when vetting technology. Additionally, NIST and the GSA have placed huge emphasis on demographic bias testing, as exemplified by the GSA’s study ([2409.12318] A large-scale study of performance and equity of commercial remote identity verification technologies across demographics). Any vendor selling to government is under pressure to demonstrate that their false match/reject rates do not disproportionately impact any race, gender, or age group. This has led to internal testing and improvements – e.g., Onfido and Microsoft both have published methodologies for reducing bias in face AI (How Onfido mitigates AI bias in facial recognition).
GDPR and Data Protection: In the EU (and other jurisdictions with similar privacy laws), the use of facial recognition and biometrics for identity verification must comply with GDPR. Biometrics are considered “special category” personal data under GDPR, requiring explicit user consent (or another narrow legal basis) and subject to strict security and minimization requirements. IDV providers usually obtain the user’s consent to process their ID photo and selfie for verification purposes. They also have to handle data retention carefully – many offer options to auto-delete biometric data after verification or store it only in a hashed form, to alleviate privacy concerns. For example, Persona’s policy lets business clients configure how long data is stored, to help them meet regional privacy rules. Providers targeting Europe often undergo third-party audits and certify to standards like ISO 27001 or SOC2, and some join the EU-U.S. Data Privacy Framework to lawfully transfer data. Another aspect is GDPR’s accuracy principle – organizations processing personal data must ensure it’s accurate and up-to-date. For IDV, this can be interpreted as a need to ensure the verification results are correct (to not wrongly deny someone access due to a false negative). In practice, a false rejection might be seen as an “inaccuracy” in personal data processing. While not usually litigated, it’s something companies pay attention to in order to avoid claims of algorithmic discrimination under GDPR or related laws.
Certification Schemes: Beyond NIST and GDPR, there are industry certifications. We’ve mentioned iBeta PAD certification (which is essentially required by many banks/fintechs to ensure liveness spoof resilience at ISO 30107 Levels 1 or 2). Many providers proudly cite passing iBeta Level 1 or 2 (Onfido, Persona, Facetec, iProov, etc. all have). There’s also the UK Digital Identity and Attributes Trust Framework (DIATF), which certifies IDV providers for use in verifying identities in the UK – companies like Jumio, Onfido, and Yoti have been certified, which involves meeting performance and security benchmarks. Similarly, in Canada, the DIACC’s trust framework and in Australia the “TDIF” set requirements for biometric accuracy (often referencing back to NIST FRVT results or ISO standards). NIST 800-63-3 at IAL2 effectively requires agencies to use services that have demonstrable equivalent assurance to an in-person check of photo ID; this has driven agencies to demand evidence from vendors (test results, audits).
Audit and Transparency: Some providers have undergone independent audits or published white papers with performance data. For instance, Onfido published a white paper on reducing bias with detailed breakdowns of false acceptance rates by demographic after various training interventions (How Onfido mitigates AI bias in facial recognition). Microsoft’s Face API team similarly published data on how they reduced error rate disparities. These are positive steps, but not all vendors share such detail publicly. The U.S. Government (GSA) study on equity (GSA testing finds variations in the accuracy of digital ID verification tech – Nextgov/FCW), once finalized in 2025, will likely shine light on each tested vendor’s strengths/weaknesses (if vendors consent to be named), which could push the industry toward more transparency. In Europe, the proposed EU AI Act could classify “remote biometric identification” systems as high-risk, meaning providers might have to undergo conformity assessments and provide documentation on accuracy, testing, and bias mitigation as a legal requirement.

Real-World User Experiences and Limitations

While metrics and certifications tell one side of the story, user experience in the wild often uncovers limitations. Identity verification, when it works seamlessly, barely gets noticed – but when it fails, users can be very vocal. Here are some real-world insights:

Demographic Disparities: As noted, certain groups have historically faced higher error rates in face matching. Older adults sometimes struggle with the selfie step (they may have more trouble aligning their face or may present an appearance that differs significantly from their ID photo taken years earlier). People with very dark skin tones have been shown in some studies to experience higher false rejection in facial recognition systems that were not properly trained – the GSA study confirmed one vendor had this issue, rejecting a disproportionate number of Black users ([2409.12318] A large-scale study of performance and equity of commercial remote identity verification technologies across demographics). This not only frustrates users but can deny access to essential services. In response, companies are diversifying training data and testing. For example, Microsoft and FaceTec both improved their algorithms after early bias critiques. Persona explicitly mentions using ethically sourced, diverse data and testing for bias (Industry-Leading, Lab-Certified Face Recognition | Persona). Still, users occasionally report anecdotes like “I had to try multiple times, but my lighter-skinned friend got through on first try” – individual experiences vary, and perception of bias can harm trust even if unintentional.
One-Time/One-Channel Verification: Some implementations (especially in government or high-security contexts) give users no fallback options – e.g., no alternative to doing the selfie. The ACLU criticized systems that “don’t provide an accessible offline alternative”, noting that forcing everyone through a selfie upload can exclude those without smartphones or with disabilities (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU) (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU). A harsh reality was during COVID, unemployment claimants who had poor internet or no webcam simply had no way to verify when states only offered the online ID.me route (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU). This taught agencies that having alternative verification methods (in-person, mail-in, or at least video chat on low bandwidth) is important for equity. From a user’s view, a reliable system isn’t just one that’s accurate when working, but one that offers help when it fails. Many vendors now offer omni-channel support: e.g., some partner with networks of retail locations where a user can show their ID to a clerk as a backup, or they offer postal verification. These aren’t camera-based, but they improve overall reliability of the identity proofing process.
Strict Retry Policies: As mentioned, a single failed attempt can put a user in “identity verification limbo.” Some exchanges or apps allow only one submission of documents to prevent fraudsters from trial-and-error. But genuine users also get only one shot – if their camera glitched or their hands shook, they might be locked out. Users have complained on forums about scenarios like being banned from a platform because the ID verification failed once and there was no second chance. Good practice is to allow at least a small number of retries (since most failures are benign issues like blur). Manual review is the ultimate fallback: companies like Persona and Onfido offer services or tools for a human agent to review the documents and selfie if automation can’t make a definitive decision. While manual review is slower (minutes or hours instead of seconds) and costlier, it dramatically increases overall success rates by rescuing false rejects. For example, one fintech noted that adding a human-overread for failed automated checks raised their total verification pass rate several percentage points and saved many customer relationships (Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform | iProov) (since those users would have been denied by AI alone). However, not all companies utilize this – some low-cost providers or strict compliance scenarios simply reject and require the user to contact support. The user experience in those cases can be painful.
Harsh Implementation Stories: A notable case was the IRS’s attempted rollout of mandatory ID.me in 2022. Taxpayers were alarmed at having to submit selfies, and reports surfaced of people unable to verify in time to meet filing deadlines. Under public pressure, the IRS dropped the requirement (though it still offers ID.me as an option) (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop) (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop). This showed that public acceptance of facial verification is still tenuous if people feel it’s not reliable or private enough. In contrast, when private-sector users perceive a clear benefit (e.g., faster onboarding for a bank account or higher security), they tend to accept it – especially younger users, 77% of whom find biometrics more convenient than traditional methods according to surveys (Onfido launches the next generation of facial biometric technology | Onfido). Providers must navigate this by being transparent (explaining why they need a selfie) and providing recourse. Some users have also highlighted accessibility issues – for example, people who are blind or have low vision might not be able to center their ID or face in the frame without help. Liveness checks that require specific movements could be hard for those with limited mobility. Regulations (like the ADA in the U.S.) are beginning to consider these aspects; providers have started including accessibility features (such as voice instructions, ability to use keyboard instead of screen tapping, etc.).

In summary, modern camera-based identity verification services are highly accurate under ideal conditions – often above 95-99% success – but real-world factors can reduce those rates. Industry leaders like Persona, Onfido, Jumio have achieved low error rates through advanced AI and liveness checks, as evidenced by independent audits and certifications (Onfido’s Real Identity Platform Improves Performance by 12x | Onfido) (). However, studies and user feedback reveal that common failure points include poor image quality, suboptimal lighting, device variability, and stringent process rules. When comparing providers, it’s clear they all have had to address the FAR vs. FRR trade-off: some, like ID.me, initially prioritized fraud prevention (low FAR) at the expense of user experience (higher FRR ~10%+ requiring manual intervention) (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop). Others, like Onfido’s Motion or Jumio with iProov, strive for a better balance, leveraging tech that keeps both types of errors extremely low (sub-1%) so that most legitimate users sail through while stopping nearly all impostors (Onfido launches the next generation of facial biometric technology | Onfido).

Conclusion

Camera-based third-party ID verification has rapidly advanced in accuracy due to improved AI models, huge training datasets, and rigorous testing. The best systems today verify identities with minimal errors – e.g., false acceptance rates on the order of 0.01% and false rejection rates well under 1% in controlled settings (Onfido’s Real Identity Platform Improves Performance by 12x | Onfido) (Onfido launches the next generation of facial biometric technology | Onfido). They incorporate multi-faceted checks (document chip reading, face matching, liveness, fraud analytics) to boost reliability and security. Nonetheless, no system is foolproof. Environmental and human factors will cause some legitimate users to fail automated checks, which is why backup procedures (retries, manual review, alternative verification) are crucial for a fair and inclusive implementation. Industry benchmarks from NIST, GSA, and DHS show that while many vendors perform at a high level, there is significant variance – indicating organizations must carefully evaluate providers (perhaps even conduct a “bake-off” pilot test of their own) (Buyer’s Guide to Identity Verification Solutions | Persona) (Buyer’s Guide to Identity Verification Solutions | Persona) rather than trusting glossy marketing stats alone.

Looking forward, ongoing independent evaluations (like the upcoming peer-reviewed GSA report in 2025 (GSA testing finds variations in the accuracy of digital ID verification tech – Nextgov/FCW) and the planned DHS Remote Identity Validation Rally in 2025 (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update)) will shed more light on each service’s strengths and weaknesses. Providers that invest in usability, broad device support, and bias reduction are likely to stand out. Likewise, compliance with evolving standards (NIST 800-63-4, AI Act, etc.) will be a differentiator – accuracy isn’t just a technical goal but a regulatory expectation and ethical mandate. Users’ real-world experiences remind us that an identity verification system’s success is measured not just by percentages in a lab, but by its ability to handle the diversity of people and conditions out in the world. In that regard, the industry is moving in a positive direction: error rates continue to drop, and awareness of edge cases is growing. With multi-layered approaches (document + biometric + database checks) and human-in-the-loop fail-safes, camera-based ID verification services can achieve both high accuracy and robust reliability, enabling security without shutting out the honest users who just want to prove “I am who I claim to be.”

Sources:

GSA/Clarkson University study on remote ID proofing (2024) (GSA testing finds variations in the accuracy of digital ID verification tech – Nextgov/FCW) ([2409.12318] A large-scale study of performance and equity of commercial remote identity verification technologies across demographics)
DHS S&T RIVTD Evaluations (2023) summarized by Biometric Update (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update) (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update)
Persona documentation and blog (Complete Guide to Document Verification: Process, Benefits & Compliance | Persona) (Industry-Leading, Lab-Certified Face Recognition | Persona)
Onfido press releases and white papers (Onfido’s Real Identity Platform Improves Performance by 12x | Onfido) (Onfido launches the next generation of facial biometric technology | Onfido)
iProov Press (Jumio partnership) (Jumio Adds iProov’s Award-Winning Liveness Detection to its KYX Platform | iProov)
House Oversight Committee findings on ID.me (2022) (A year after outcry, IRS still doesn’t offer taxpayers alternative to ID.me | CyberScoop)
ACLU analysis of ID.me issues (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU) (Three Key Problems with the Government’s Use of a Flawed Facial Recognition Service | ACLU)
Socure performance claims (Socure Launches Compliance Product Suite to Optimize ID …)
NIST and FIDO standards (ISO 30107 PAD, NIST SP 800-63-3/4) (Understanding the results of DHS S&T’s RIVTD biometrics assessment | Biometric Update) ().

Reflections on Teaching Distributed Systems

June 10, 2023

During January through April 2023 I taught CPSC 416 at UBC. It was the first time I had taught this course and it would not have been possible without the assistance I received from others, notably Ada Gavrilovska and Ivan Bestchastnikh both of whom allowed me to use their materials in creating my own course. Of course, I reordered things, and adapted them to fit the class.

Teaching a course for the first time is always an illuminating experience. I’ve designed and taught classes on systems topics, including elements of DCE/DFS, the distributed file system on which I worked back when I was a twenty-something software developer, classes on Windows driver development (device drivers, file systems, and file system filter drivers,) and Windows kernel debugging. I even explored utilizing online mechanisms for providing a non-linear educational approach to core OS concepts (processes, threads, scheduling, and synchronization) as part of my MSCS work at Georgia Tech. Thus, I have enjoyed engaging in education and looking for ways to do better for much of my life. I have always found insights each time I teach a new course. CPSC 416 was no exception to this.

I have never taken a distributed systems class. I learned about distributed systems organically. In my senior year at the University of Chicago I worked for the nascent Computer Science department as part of the facilities team and part of that work included networking. I remember soldering connectors together for Ethernet connections of the time – vastly different than the RJ-45 connections we use now, but the same technology we use today. After graduation I took a job at Stanford working with David Cheriton, who ran the “Distributed Systems Group.” The V operating system, which his group developed, is what I used on my desktop, and I built a number of network components as part of my work for him, including a network protocol (VMTP) which I implemented on a BSD 4.2 UNIX based system, diskless bootstrap drivers, and even an IP-multicast version of a multi-player Mazewar variant.

From Stanford I went to Transarc, a CMU-research inspired start-up company that had two very different product directions: one was an online transaction processing system, and the other was a commercialization of the AFS distributed file system. I also worked on the successor (the DCE/DFS project I mentioned earlier.)

Thus, my background in distributed systems was building distributed systems, often from the perspective of not knowing what I was doing but being surrounded by smart people that helped me figure it out. In some ways, that was my objective in teaching CPSC 416: paying that hard work forward.

One observation now: things that you learn in your twenties become “assumed knowledge” quite easily in your fifties. Thus, I just assumed that everyone knew about how databases maintain consistency in the face of failures. This turns out not to be true. So, one of my first lessons here was that I need to explain this up front in order for many of the things I say afterwards make sense. What is the point about replicating a log (what database people usually refer to as a journal) if you don’t understand that a log is the basic mechanism we use to restore consistency in a database. Lesson 1: teach people about databases and recoverability.

The second observation stems from my first oversight: building transactionally safe recoverable systems is hard. You’d think I’d know that, since I built a transactionally safe recoverable system back in the late 1980s and early 1990s as part of my work on Episode, the local physical file system that we used to support some of the nifty features of DCE/DFS. Episode, in some form, continues to be used in production today (file systems have unnaturally long lives if they get any serious adoption.) I would be quite surprised if the underlying transactional system were significantly different than it was thirty years ago when I worked on it. Lesson 2: teach people about building transactionally safe databases. Related to this is explaining key-value stores explicitly. They are a key part of the programming assignments and understanding why we use them helps. They typically form the basis of databases and file systems. Indeed, file systems are typically key-value stores with a name space built on top of them. The keys are limited (integers representing an entry in a potentially sparse table of objects) and the values are mutable (which complicates implementation and correctness).

The third observation is not an original one, as I have learned in conversations with other educators. There is a fundamental mis-alignment between the objectives of students (which is essentially grade maximization) and my objectives (which is “learn useful stuff that will help you throughout your career.”) This isn’t a big surprise – after all, I have published work in plagiarism reduction – but trying to find ways to fix it is challenging. I had not expected the insane amount of pressure students seem to feel to maximize their grades. I did try to mitigate this somewhat by offering extra credit opportunities, though in the end that seemed to create stress for many to exploit those. Why people who are getting grades in the 90%+ range are “afraid of failing” is beyond me. Lesson 3: extra credit creates more stress than it alleviates. I don’t think this is entirely the case but I’ll be more cautious about using it in the future. Still, I don’t want people to worry that they are going to fail so my thought is to provide an incentive to participate that mitigates the likelihood of them failing.

My fourth observation is that I had never read the various papers about distributed consensus side by side before. Doing so was an eye-opening experience. What I learned is that in many cases the complications in those papers relate to: (1) optimizations; and (2) recovery. Thus, next time I teach this class I want to spend more time walking through the baseline protocol and then pointing out optimizations and handling recovery. One example of this was when I had a student point out that the Paxos Made Moderately Complex paper (PMMC) states that during the leader election phase the voting party sends along a list of their accepted but not committed proposals (from the previous leadership). This is not part of the protocol. It is an optimization that makes recovery faster and more efficient, but you can’t rely upon it to maintain correctness. Now that I understand this point of confusions better, I think I can walk people through it and distinguish this. Doing so will help people understand the underlying protocol better and then the optimizations we use to ensure it works correctly. Lesson 4: walk through the papers more carefully, explaining the base protocol and then pointing out that the primary difference is in optimizations and recovery mechanisms.

My fifth observation is that students focus too much on code and not enough on understanding. Distributed systems is an area in which one must think through failure cases, identify how you will handle them, and what you assume is going to be true (your “invariants”) throughout your code base. I did introduce some tools for doing this (modeling and TLA+ specifically) but I did not incorporate them into the actual assignments. I did have them write reports, but those were post-hoc reports. I would like to try making the design cycle a more prominent portion of this work, encouraging people to think about what they are building rather than trying to hack their way through it. One piece of feedback from several students was that my advice to “walk away and think through the project” was quite helpful. I’d like to make the structure of the course make that happen more naturally. I also think that by having explicit design milestones it would reduce stress by encouraging students to work on the projects before the deadline. Lesson 5: design is more important than code, but code helps students verify their design reflects good understanding. The challenge will be in finding the right balance between the two.

I have other, smaller observations as well that I won’t break out but I’ll capture here:

Extensions don’t really help.
Providing a flexible late policy can be helpful, but it often creates quite a lot of stress.
Some students abhor teams, some like them. I need to find a way to accommodate both learning styles in a way that is equitable.
Do as much as possible to simplify grading exams.
Make a conscious effort after each lesson to create questions for the exam. I provided individualized exams (drawn from a pool of questions) and I wish I’d had more questions from which to draw. What was particularly nice was being able to provide people with their own personalized exam’s answers right at the end of the exam. It also helped me identify some issues.

I do think a number of steps that I took in this course worked well. People (generally) liked the failure examples, they liked the responsiveness, they liked the material. It is easy to focus on just the negatives, but I want to make sure and acknowledge the positives because it is important to preserve those elements.

Finally, I have agreed to teach this course again in the fall (which for UBC means “Winter Term 1”) so I will have an opportunity to incorporate what I have learned into the next course offering. I’m sure I’ll have more insights after that class.

What is the Optimal Location for Storing Metadata

November 6, 2022

The past month has included both a very interesting talk from someone at a major storage vendor and an in-depth discussion about my work and how it might be applicable to an issue that confronts the Metaverse community. I haven’t been at the keyboard much (at least not for my research) but I have been mulling this over as I have worked to try and explain these insights. Each iteration helps me refine my mental model by considering what else I have learned. Fortunately, this latest round doesn’t impact the work that I have done, but it has provided me with a model that I think could be useful in explaining this work to others.

I have previously talked about a type of metadata that I call activity context. Of course, there is quite a lot of metadata that is involved in managing storage and I have been using a model in which the metadata I am collecting is not at the point of storage but rather at the point of analysis. In my case, the point of analysis is on (or near) my local device ecosystem. As I learned more about the needs of the emerging metaverse field (by speaking with my friend Royal O’Brien, who is the general manager for the Open 3D Foundation, which is part of the Linux Foundation) and combined some of what I learned there with insights I gained from a recent talk given to my research group I observed what I think are some useful insights:

Storage vendors have no mechanism for capturing all the kinds of activity data that I envision using as the basis for activity context.
Some high-performance data consumers need to maintain replicated data and use metadata about that data to make critical decisions.
Metadata needs to be close to where it will be consumed.
Metadata needs to be produced where the information is available and optimally where it is least expensive to do so.

That isn’t a long list, but it is one that requires a bit more unpacking. So I’m going to dive deeper, step by step. This probably isn’t the right order, but I will start here and worry about (re)-organizing it later.

Metadata Production

I had not really considered the depth of the question about where to produce the meta-data until I started mulling over the myriad of questions that have arisen recently. The cost of producing metadata can be a critical factor. Agents that extract semantic information about the data (e.g., its content) need to be close to the data. However, it is important to note that is not the same as “the final location of the data” but rather “a current location of the data.” Yet, even that isn’t quite right: metadata might be extracted from something other than the data, like something from the running system, or even an external source. For example, the activity data that I have been focused on collecting (see System Activity) largely arises on the system where the data itself is accessed. The metaverse model is one where the user has considerable insight (ah, but a bit more on this later) and since I’ve always envisioned an extensible metadata management system it makes sense to permit a specialized application to contribute to the overall body of metadata.

Thus, the insight here is that it makes sense to generate metadata at the “lowest cost” point to do so. For example, the activity data on my local machine can’t be collected by a cloud storage engine. It could be collected by an agent on the local machine and sent to the cloud storage engine, but that runs into a separate cost that I’ll touch on when I describe where we should be storing metadata. For example, extracting semantic content makes sense to do at the point of production and again at the point of storage. Activity data, which is related to “what else is happening” can’t be extracted at the point of storage. Even causal data (e.g., the kinds of activity information we convert into provenance data to represent causal relationships) can’t easily be replicated at the storage engine. There’s another subtle point here to consider: if I’m willing to pay the cost of producing metadata it seems intuitively obvious that it is probably worth storing the results of that metadata. For example, I find that I often end up doing repetitive searches – this past week, working on a project completely unrelated to my research, I found myself repeatedly doing searches over the same data set using the same or similar terms. For example, if I want to find files that have both the term “customer” and “order” in them and then repeat that with “customer” and “device_id” I have to do complex compound searches that can take 5-10 minutes to produce. I suspect this can be made more efficient (though I don’t know if this is really a useful test case – I just keep wondering how I could support this sort of functionality, which would enable us to figure out if it is useful.)

So, back to producing metadata. Another cost to consider is the cost to fetch the data. For example, if I want to compute the checksum of a file, it is probably most efficient to do so when it is in the memory of the original device creating it or possibly on the device where it is stored (e.g., a remote storage server.) Even if it is the same cost I need to keep in mind that I will be using devices that don’t compute the same checksum. That lack of service uniformity helps me better understand the actual cost: if the storage device does not support the generation of the metadata that I want then my cost rises dramatically because now I have to pull the data back from the storage server so I can compute the checksum I want to use. Thus, I think what drives this question is where we store that metadata, which is leading to my next rambling thought process in the next section.

In the case where the metadata is being provided externally, I probably don’t care where it is produced – that’s their problem. So, for the metaverse data storage challenge I really need to focus more on where I am storing the metadata rather than where it is generated (at least for now.)

Medata Storage

One question I’ve been handwaving is the “where do you store the metadata?” I started thinking about this because the real answer is ugly. Some of that metadata will be stored on the underlying storage, e.g., a file system is going to store timestamps and length information in some form regardless of specific issues like time epochs. However, as I was mulling over some of the issues involved in object management needs for metaverse platforms (ugh, a tongue-twister with the “metaverse” buzzword) I realized that one of the challenges described to me (namely the cost associated with fetching data) is really important to me as well:

To be useful, this metadata needs to be present everywhere it is analyzed – it is impractical for us to be fetching data across the network if we want this to have decent performance. I can certainly handwave some of this away (“oh, we’ll just use eventually consistent replication of the metadata”) but I don’t expect that’s terribly realistic to add to a prototype system. What probably does make sense is to think that this will be stored on a system that is “close to” the sources that generate the metadata. It might be possible to construct a cloud-based metadata service, but that case has additional considerations that I’m mulling over (and plan on capturing in a future blog post – this one is already too long!) Thus, I suspect that this is a restricted implementation of the replication problem.
Metadata does not need to be close to the data. In fact, one of the interesting advantages of having the metadata close to where it is needed is that it helps overcome a major challenge in using distributed storage: the farther away the data storage is from the data consumer, the higher the cost of fetching that data. In turn, the benefits of having more metadata is that it helps improve the efficiency of fetching data, since fetching data that we don’t need is wasteful. In other words, a cost benefit associated with having more metadata is that we can work to minimize unnecessary data fetching. Indeed, this could be a solid metric for determining the efficiency of metadata and search algorithms that use the metadata: the “false fetch rate.” The benefits of this are definitely related to the cost of retrieving data. Imagine (for example) that you are looking through data that is expensive to retrieve, such as Azure Cold Blob Storage or Amazon Glacial Storage. The reason that people use these slow storage services is that they are extremely cost efficient: this is data that is unlikely to be needed. While this is an extreme example, it also makes it easier to understand why additional metadata is broadly beneficial, since any fetch of data from a remote system is that is not useful is a complete waste of resources. Again, my inspiration here was the discussion with Royal about multiple different instantiations of the same object that appear in the metaverse. I will touch on this when I get into that metaverse conversation. For now, I note that these instantiations of a single digital object might be stored in different locations. The choice of a specific instance of this is typically bounded by several costs involved, including the fetch cost (latency + bandwidth) and any transformation costs (e.g., CPU cost.) This becomes quite interesting in mobile networks where the network could impose surge pricing as well and there are capacity limitations combined with the hard requirements that these objects need to be available for use quickly (another aspect of cost.)

My sense is there is probably more to say here, but I captured some key ideas and I will consider how to build on this in the future.

Metaverse Data Needs

That conversation with Royal was quite interesting. I’ve known him for more than a decade and some of what I learned from him about the specialized needs of the game industry led me to question things that I learned from decades of building storage systems. That background in game development has positioned him to point out that many of the challenges in metaverse construction have already been addressed in the game development area. One interesting aspect of this is in the world of “asset management.” An asset in a game is pretty much anything that the game uses to create the game world. Similarly, a metaverse must also combine assets to permit 3D scaling as it renders the world for each participant of that world. He explained to me by way of example, that one type of graphical object is often computed at different resolutions. While it is possible for our devices to scale these, the size of the objects and the computational cost of scaling is high. In addition, the cost of fetching these objects can be high as well; he was telling me that you might need 200 objects in order to render the current state of the world for an individual user. If their average size is 60MB it becomes easy to see how this is not terribly practical. In fact, what is usually required are a few of these very high-resolution graphical objects and lower resolution versions of the others. For example, objects that are “far away in the distance” need not have the same resolution. While he didn’t point it out, I know that I have seen games where sometimes objects have low resolution and are later repainted with higher resolution images. I am now wondering if I saw this exact type of behavior already being practiced.

Let’s combine this with the need to distribute these objects broadly and to realize there is a high degree of locality involved. Metaverse participants interacting with each other in a 5G or 6G network are likely to be accessing many of the same objects. Thus, we are likely to see a high degree of correlation across edge nodes within the mobile network. Similarly, it moves to a very distributed storage model, where data objects are not necessarily being retrieved from a central storage server but rather edge storage servers or even peer clients. One benefit of using strong checksums is that it allows easy to verify replication in untrusted networks – something like bittorrent or even IPFS do with their own checksums. As long as the checksum comes from a trusted source, the data retrieved can be verified.

In this case the metadata would correspond to something very different than I’d been considering:

An identifier of the object itself
A list of one or more specific instances of that objects with a set of properties
A list of where each of these instances might be stored (I’m choosing to use an optimistic list here because the reality is sources will appear and disappear.)

Independent of this would be information about the constraints involved: the deadline required for receiving the data to be timely, the cost for retrieving the various versions, etc. With this information both the edge and end devices can make decisions: which versions to fetch and from where as well as placement, caching, and pre-fetching decisions. All of these are challenging and none of them are new so I’m not going to dive in further. What is new is the idea that we could embed the necessary metadata within a more general-purpose metadata management system overlaying disparate storage systems. This is a fairly specialized need, but it is also one Royal observed needs to be solved.

Oh, one final number that sticks out in my mind: Royal told me that a single asset could consist of around 200 different versions, including different resolutions and different formats required by the various devices. I was quite surprised at this, but it also helped me understand the magnitude of the problem.

While I have considered versioning as a desirable feature, I had never considered parallel versions quite like this. Having these kinds of conversations helps me better understand new perspectives and broaden my own thinking.

I left that conversation knowing that I had just barely started to wrap my head around the specific needs of this area. I capture those thoughts here in hopes I can foster further thought about them, including more conversations with others.

Storage Vendors

A couple weeks ago we had a guest speaker from a storage vendor talking about his thoughts along the future for his company and their products. There were specific aspects of that talk that really stood out to me:

Much of what he talked about was inward focused. In other words, it was about the need for better semantic understanding. I realized that the ideas on which I’m working – of using extrinsic information to find relationships between files was not even on his horizon, yet could be very beneficial to him – or to any large storage vendor.
He acknowledged many of the challenges that are arising as the sheer volume of storage continues to grow. Indeed, each time I think about this I remember that for all the emphasis on fast access storage (e.g., NVRAM and SSDs) the slower storage tiers continue to expand as well: hard disks now play more of an archival role. Microsoft Research’s Holographic Storage Device, for example, offers a potential higher capacity device for data center use. Libraries of recordable optical storage or even high capacity linear tape also exist and are used to keep vast amounts of data.
During that time I’d been also thinking about how to protect sensitive information from being exploited or mined. In other words, as a user of these services, how can I store data and/or metadata with them that doesn’t divulge information. After the talk I realized that the approach I’d been considering (basically providing labels the meaning of which requires a separate decoder ring) could be quite useful to a storage vendor: such sanitized information could still be used to better understand the relationships – ML driven pattern recognition (e.g., clustering) without requiring that the storage vendor understand what those patterns mean. Even providing that information to the end user could minimize the amount of extra data being fetched which in turn would improve the use of their own storage products. Again, I don’t think this is fully fleshed out, but it does seem to provide some argument for storage vendors to consider supporting enhanced metadata services.

I admit, I like the idea of enabling storage vendors to provide optimization services that do not require they understand the innards of the data itself. This would allow customers with highly sensitive data to store it in a public cloud service (for example) in fully encrypted form and still provide indexing information for it. The “secret decoder rings” can be maintained by the data owner yet the storage vendor can provide useful value-added services at enterprise scale. Why? Because, as I noted earlier, the right place to store metadata is as close as possible to the place where it is consumed. At enterprise scale, that would logically be someplace that is accessible throughout the enterprise.

At this point I realized that our propensity to store the metadata with the data really does not make sense when we think of multiple storage silos – it’s the wrong location. Separating the metadata service, placing it close to where the metadata is being absorbed, and using strategically located agents for generating the various types of metadata, including activity context and semantic information, all make sense because the owner of that data is really “closest” to where that metadata is used. A “file system” that maintains no metadata is really little more than a key-value store, as the metadata server can be maintained separately. Of course, that potentially creates other issues (e.g., space reuse.) I don’t think I need to solve such issues because in the end that consideration is not important at this point in my own research.

So Much Metadata, So Little Agreement

October 6, 2022

Earlier this year I was focused on collecting activity data. I made reasonable progress here, finding ways to capture local file system activity as well as activity against two different cloud service providers. I keep looking at other examples, as well, but rather than try for too much breadth, I decided to focus on the three sources I was able to get working and then push deeper into each source.

First, there is little agreement as to what metadata should be present. There are a few common fields, but then there are numerous fields that only show up in some subset of data sources – and this is just for file systems where presumably they’re storing the same basic stuff. What’s most common:

A name
A timestamp for when it was created
A timestamp for when it was modified
A timestamp for when it was accessed
Some attributes (read-only, file, directory, special/device)
A size

Of course, even here there isn’t necessarily agreement. Some file systems have limited size names or limited character sets they support. Timestamps are stored relative to some well-known value. UNIX traditionally chose January 1, 1970 00:00:00 UTC and that number comes up quite often. IBM DOS (and thus MS-DOS) for x86 PCs used January 1, 1980. Windows NT chose January 1, 1601. I do understand why this happens: we store timestamps in finite size fields. When the timestamp “rolls over” we have to deal with it. That was the basis of the Y2K crisis. Of course, I’ve been pretty anal about this. In the late 1970s when I was writing software, I made sure that my code would work at least to 2100 (2100 is not a leap year while 2000 was a leap year because of the rules for leap years.) I doubt that code survived to Y2K.

But file systems designers worry about these sorts of things because we know that file systems life surprisingly long lifetimes. When the Windows NT designers first settled on a 64 bit timestamp in the late 1980s they gleefully used high precision timestamps: 100 nanoseconds. But 64 bits is a lot of space and it allows storing date for many millennia to come.

Today, we store data all over the place. When we move it, those timestamps will be adjusted to fit whatever the recipient storage repository wants to use. In addition, any other “extra” metadata will silently disappear.

How much extra metadata exists? I’ve spent the past few weeks wading through Windows and even though I knew there were many different types of metadata that could be stored, I chuckled at the fact there is no simple way to retrieve all that metadata:

There are APIs for getting timestamps and sizes
There are APIs for getting file attributes
There are APIs for getting file names
There are APIs for getting a list of “alternate data streams” that are associated with a given file.
There are APIs for retrieving the file identifier of the file – that’s a magic number that can be combined with data from other APIs to associate activity information (and that is the reason I went spelunking for this information in the first place.)
There are APIs for retrieving “extended attributes” of files (EAs). EAs are older than Windows NT (1993) but have been difficult to use from the Win32 API that most applications use.
There are now APIs for retrieving linux related attribute information (see FILE_STAT_LX_INFORMATION) on top of the existing attributes.
There are 128 bit GUIDs and 128 bit File IDs

I’m sure I didn’t hit them all, but the point is that these various metadata types are not supported by all file systems. On Windows at least, when you try to copy a file from NTFS to FAT32 (or ExFAT) it will warn you about potential data loss if certain attribute data is present (specifically alternate data streams.) The reason I think they first added this (it was added a long time ago) was because in the early days of downloading files from the internet it became useful to tag them as being potentially suspect. This is done by adding an alternate data stream to the file (::Zone_Identifier) and then information about the remote location from which the file was downloaded.

Thus, this metadata isn’t added just because, it is added because it enables potentially useful functionality.

Here’s something I’ve never seen anyone do thus far – that doesn’t mean nobody does it, just that I haven’t seen it: nobody indexes based upon these attributes. The named stream Zone_Identifier could be used to find all the files that you’ve downloaded from the internet, regardless of where on your computer. I laugh at this because I know a number of times I’ve downloaded content and then had no idea where it was downloaded. With an index of downloaded content, I could just look at the last five things I downloaded – problem solved.

While I have spent a fair bit of time talking about Windows, I have seen similar issues on Linux. It is only in the past couple of years that the extended stat structure (statx) has become mainstream supported. Several file systems that run on Linux support extended attributes. The idea behind streams isn’t particularly novel (we implemented something we called property lists in Episode at the same time the NTFS team was deciding to all full-blown named alternate data streams to their file system. Ours were just limited in size – an approach that I think the ReFS team took because they found nobody was really using large alternate data streams.)

Bottom line: one of the interesting challenges in using activity data is that as similar as file systems seem on the surface they often implement different/special semantics using metadata. How to make sense of this is a significant problem and one that I do not expect to fully address. Despite this, I can see there is tremendous benefit to using even some of this metadata to build relationships between different storage locations. That, however, is a topic for another day.

Challenges of Capturing System Activity

February 16, 2022

A key aspect of the work I am doing for Indaleko is to “capture system activity” so that it can be used to form “activity contexts” that can then be used to inform the process of finding relevant information. As part of that, I have been working through the work of Daniela Vianna. While I have high-level descriptions of the information she collected and used, I need to reconstruct this. She collects data from a variety of sources. The most common source of information comes from web APIs to services such as Google and Facebook. In addition, she also uses file system activity information.

Since my background is file systems, I decided to start on the file system activity front first. Given that I’ve been working with Windows for three decades now, I decided to leverage my understanding of Windows file systems to collect such information. One nice feature of the NTFS file system on Windows is its support for a form of activity log known as the “USN Journal.” Of course, one of my handicaps is that I am used to using the native operating system API, not the libraries that are implemented on top of it. This is because when building file systems on Windows I have always been interested in testing the full kernel file systems interface. While there are a few specific features that cannot be exercised with just applications, there are still a number of interfaces that cannot be tested using the typical Win32 API that can be tested using the native API. In recent years the number of features that have been hidden from the Win32 API has continued to decrease, which has diminished the need to use the native API. I just haven’t had any strong need to learn the Win32 API – why start now?

I decided the model I want to use is a service that pulls data from the USN journal and converts it into a format suitable for storing in a MongoDB database. I decided to go with Mongo because that is what Vianna used for her work. The choice at this point is somewhat arbitrary but MongoDB makes sense because it tends to work well with semi-structured data, which is what I will be handling.

Similarly, I decided that I’d write my service for pulling USN Journal data from the NTFS file system(s) in C# since I have written some C# in the past, it makes doing some of the higher level tasks I have much easier, and is well-supported on Windows. I have made my repository public though I may restructure and/or rename it at some point (currently I call it CSharpToNativeTest because I was trying to invoke the native API as unmanaged code from C#). The most common approach to this is to utilize a specific mechanism (the “PInvoke” mechanism) but after a bit of trial-and-error I decided I wanted something that would be easier for me to debug, so instead of pulling the native routine directly from ntdll.dll I load it from my own DLL (written in C) and that then invokes the real native call. This allows me to see how data is being marshaled and delivered to the C language wrapper. I also tried to make the native API “more C# friendly.” I am sure it could be more efficient, but I wanted to support a model that I could extend and hopefully it will be easier to make it more efficient should that prove necessary.

One thing I did was to script the conversion of all the status values in ntstatus.h into a big C# enum type. The benefit of this is that when debugging I can automatically see the mnemonic name of the status code as well as its numeric value. I then decided to provide the layer needed to map the various volume names used on Windows around, with device names, device IDs, and symbolic links (drive letters) that can be mapped. While I have not yet added it, I wrote things so that it should be fairly straight-forward to add a background thread which wakes up when devices arrive or disappear. As I have noted before “naming is hard.” This is just one more example of the flexibility and challenges with aliasing and naming.

Finally, I turned my attention to the USN journal. I found some packages for decoding USN journal entries; most were written to parse the data from the drive, while a few managed dynamic access. Since I want this to be a service that monitors the USN journal and keeps adding information into the database, I decided to write C# code to use the API for retrieving that information. At this point, what I have is the ability to scan all the volumes on the machine – even if they do not have drive letters – and query them to see if they support a USN journal. I do this properly – I query the file system attributes (using the NtQueryVolumeInformationFile native API) and check if the bit showing USN journal support is marked. I do not use the file system name, an approach I’ve always considered to be a hack, especially since I have been in the habit of writing file systems that support NTFS features, including named data streams, extended attributes, and object IDs. In fact, the ReFS file system on Windows also supports USN journals, so I’m not just being my usual pedantic developer self in this instance.

At this point, I am able to identify volumes that support USN journals, open them and find out if USN is turned on (it is by default on the system volume, which is almost always the “C:” drive, though I enjoy watching things break when I configure a system to use some other drive letter.) I then extract the information and convert it to in-memory records. At the moment I just have it wait a few seconds and pull the newest records, but my plan is to evolve this into a service that I can run and it can keep pulling data and pushing it into my MongoDB instance.

At this point, I realized I do not really know that much about MongoDB so I have decided to start learning a bit more about it. Of course, I don’t want to be a MongoDB expert, so I also have been looking more carefully at Daniela Vianna’s work, trying to figure out what her data might have looked like and think about how I’m going to merge what she did into what I am doing. This is actually exciting because it means I’m starting to think of what we can do with this additional information.

This afternoon I had a great conversation with one of my PhD supervisors about this and she was making a couple of suggestions about ways to consume this data. That she was suggesting things I’d also added to my list was encouraging. What are we thinking:

We can consider using “learned index structures” as we begin to build up data sets.
We can use techniques such as Google BERT to facilitate dealing with the API data that Vianna’s work used. I pointed out that the challenges of APIs that Vianna pointed out are similar to languages: they have meaning and those meanings can be expressed in multiple ways.
The need for being able to efficiently find things is growing rapidly. She was explaining some work that indicates our rate of data growth is outstripping our silicon capabilities. In other words, there is a point at which “brute force search” becomes impractical. I liked this because it suggests what we are seeing with our own personal data is a leading indicator of the larger problem. This idea of storing the meta-data independent of the data is a natural one in a world where the raw information is too abundant for us to just go looking for an item of interest.

So, my work continues, mostly mundane and boring, but there are some useful observations even at this early stage. Now to figure out what I want the data in my database to look like and start storing information there. Then I can go figure out what I did right, what I did wrong, and how to improve things.

Aside: one interesting aspect of the BERT work was their discussion of “transducers.” This reminded me of Gifford’s Semantic File System work, where he used transducers to suck out semantic information from existing files.

Brainiattic: Remember more with your own Metaverse enhanced brain attic

January 13, 2022

I recently described the idea of “activity context” and suggested that providing this new type of information about data (meta-data) to applications would permit improve important tasks such as finding. My examining committee challenged me to think about what I would do if my proposed service – Indaleko – already existed today.

This is the second idea that I decided to propose on my blog. My goal is to find how activity context can be used to provide enhanced functionality. My first idea was fairly mundane: how can we improve the “file browsing” experience in a fashion that focuses on content and similarity by combining prior work with the additional insight provided by activity context.

My initial motivation for this second idea was motivated by my mental image of a personal library but I note that there’s a more general model here: displaying digital objects as something familiar. When I recently described this library instantiation of my brain attic the person said “but I don’t think of digital objects as being big enough to be books.” To address this point: I agree, another person’s mental model for how they want to represent digital data in a virtual world need not match my model. That’s one of the benefits of virtual worlds – we can represent things in forms that are not constrained by what things must be in the real world.

In my recent post about file browsers I discussed Focus, an alternative “table top” browser for making data accessible. One reason I liked Focus is that the authors observed how hierarchical organization does not work in this interface. They also show how the interface is useful and thus it is a concrete argument as to at least one limitation of the hierarchical file/folder browser model. Another important aspect of the Focus work was their observation that a benefit of the table top interface is it permits different users to organize information in their own way. A benefit of a virtual “library” is that the same data can be presented to different users in ways that are comfortable to them.

Of course, the “Metaverse” is still an emerging set of ideas. In a recent article about Second Life Philip Rosedale points out that existing advertising driven models don’t work well. This begs the question – what does work well?

My idea is that by having a richer set of environmental information available, it will be easier to construct virtual models that we can use to find information. Vannevar Bush had Memex, his extended memory tool. This idea turns out to be surprisingly ancient in origin, from a time before printing when most information was remembered. I was discussing this with a fellow researcher and he suggested this is like Sherlock Holmes’ Mind Palace. This led me to the model of a “brain attic” and I realized that this is similar to my model of a “personal virtual library.”

The Sherlock Holmes article has a brilliant quotation from Maria Konnikova: “The key insight from the brain attic is that you’re only going to be able to remember something, and you can only really say you know it, if you can access it when you need it,”

This resonates with my goal of improving finding, because improving finding improves access when you need it.

Thus, I decided to call this mental model “Braniattic.” It is certainly more general than my original mental model of a “personal virtual library,” yet I am also permitted to have my mental model of my pertinent digital objects being projected as books. I could then ask my personal digital librarian to show me works related to specific musical bands, or particular weather. As our virtual worlds become more capable – more like the holodeck of Star Trek – I can envision having control of the ambient room temperature and even the production of familiar smells. While our smart thermostats are now capturing the ambient room temperature and humidity level and we can query online sources for external temperatures, we don’t actively use that information to inform our finding activities, despite the reality is that human brains do recall such things; “it was cold out,” “I was listening to Beethovan,” or “I was sick that day.”

Thus, having additional contextual information can be used at least to improve finding by enabling your “brain attic.” I suspect that, once activity context is available we will find additional ways to use it in constructing some of our personal metaverse environments.

Using Focus, Relationship, Breadcrumbs, and Trails for Success in Finding

January 12, 2022 / 2 Comments on Using Focus, Relationship, Breadcrumbs, and Trails for Success in Finding

As I mentioned in my last post, I am considering how to add activity context as a system service that can be useful in improving findings. Last month (December 2021) my examination committee asked me to consider a useful question: “If this service already existed what would you build using it?”

The challenge in answering this question was not finding examples, but rather finding examples that fit into the “this is a systems problem” box that I had been thinking about while framing my research proposal. It has now been a month and I realized at some point that I do not need to constrain myself to systems. From that, I was able to pull a number of examples that I had considered while writing my thesis proposal.

The first of this is likely what I would consider the closest to being “systems related.” This hearkens back to the original motivation for my research direction: I was taking Dr. David Joyner’s “Human-Computer Interaction” course at Georgia Tech and at one point he used the “file/folder” metaphor as an example of HCI. I had been wrestling with the problem of scope and finding and this simple presentation made it clear why we were not escaping the file/folder metaphor – it has been “good enough” for decades.

More recently, I have been working on figuring out better ways to encourage finding, and that is the original motivation for my thesis proposal. The key idea of “activity context” has potentially broader usage beyond building better search tools.

In my research I have learned that humans do not like to search unless they have no other option. Instead, they prefer to navigate. The research literature says that this is because searching creates more cognitive load for the human user than navigation does. I think of this as meaning that people prefer to be told where to go rather than being given a list of possible options.

Several years ago (pre-pandemic) Ashish Nair came and worked with us for nine weeks one summer. I worked with him to look at building tools to take existing file data across multiple distinct storage domains and present them based upon commonality. By clustering files according to both their meta-data and simply extracted semantic context, he was able to modify an existing graph data visualizer to permit browsing files based on those relationships, regardless of where they were actually stored. While simple, this demonstration has stuck with me.

Ashish Nair (Systopia Intern) worked with us to build an interesting file browser using a graph data visualizer.

Thus, pushed to think of ways in which I would use Indaleko, my proposed activity context system, it occurred to me that using activity context to cluster related objects would be a natural way to exploit this information. This is also something easy to achieve. Unlike some of my other ideas, this is a tool that can demonstrate an associative model because “walking a graph” is an easy to understand way to walk related information.

There is a small body of research that has looked at similar interfaces. One that stuck in my mind was called Focus. While the authors were thinking of tabletop interfaces, the basic paradigm they describe, where one starts with a “primary file” (the focus) and then shows similar files (driven by content and meta-data) along the edges. This is remarkably like Ashish’s demo.

The exciting thing about having activity context is that it provides interesting new ways of associating files together: independent of location and clustered together by commonality. Both the demo and Focus use existing file meta-data and content similarity, which is useful. With activity context added as well, there is further information that can be used to both refine similar associations as well as cluster along a greater number of axis.

Thus, I can show off the benefits of Indaleko‘s activity context support by using a Focus-style file browser.

Better Finding: Combine Semantic and Associative Context with Indaleko

January 11, 2022

Last month I presented my thesis proposal to my PhD committee. My proposal doesn’t mean that I am done, rather it means that I have more clearly identified what I intend to make the focus of my final research.

It has certainly taken longer to get to this point than I had anticipated. Part of the challenge is that there is quite a lot of work that has been done previously around search and semantic context. Very recent work by Daniela Vianna relates to the use of “personal digital traces” to augment search. It was Dr. Vianna’s work that provided a solid theoretical basis for my own proposed work.

Our computer systems collect quite an array of information, not only about us but also about the environment in which we work.

In 1945 Vannevar Bush described the challenges to humans of finding things in a codified system of records. His observations continue to be insightful more than 75 years later:

Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.

The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature.

I find myself returning to Bush’s observations. Those observations have led me to ask if it is possible for us to build systems that get us closer to this ideal?

My thesis is that collecting, storing, and disseminating information about the environment in which digital objects are being used provides us with new context that enables better finding.

So, my proposal is about how to collect, store, and disseminate this type of external contextual information. I envision combining this with existing data sources and indexing mechanisms to allow capturing activity context in which digital objects are used by humans. A systems level service that can do this will then enable a broad range of applications to exploit this information to reconstruct context that is helpful to human users. Over my next several blog posts I will describe some ideas that I have with what I envision being possible with this new service.

The title of my proposal is: Indaleko: Using System Activity Context to Improve Finding. One of the key ideas from this is the idea that we can collect information the computer might not find particularly relevant but the human user will. This could be something as simple as the ambient noise in the user’s background (“what music are you listening to?” or “Is your dog barking in the background”) or environmental events (“it is raining”) or even personal events (“my heart rate was elevated” or “I just bought a new yoga mat”). Humans associate things together – not in the same way, nor the same specific elements – using a variety of contextual mechanisms. My objective is to enable capturing data that we can then use to replicate this “associative thinking” that helps humans.

Ultimately, such a system will help human users find connections between objects. My focus is on storage because that is my background: in essence, I am interested in how the computer can extend human memory without losing the amazing flexibility of that memory to connect seemingly unrelated “things” together.

In my next several posts I will explore potential uses for Indaleko.

Laundry Baskets: The New File System Namespace Model

September 28, 2021

A large pile of laundry in a laundry basket, with a cat sleeping on the top. — The “Laundry Basket” model for storage.

While I’ve been quiet about what I’ve been doing research-wise, I have been making forward progress. Only recently have ideas been converging towards a concrete thesis and the corresponding research questions that I need to explore as part of verifying my thesis.

I received an interesting article today showing that my research is far more relevant than I’d considered: “FILE NOT FOUND“. The article describes that the predominant organizational scheme for “Gen Z” students is the “Laundry Basket” in which all of their files are placed. This is coming as a surprise to people who have been trained in the ways of the hierarchical folder metaphor.

While going through older work, I have found it is intriguing that early researchers did not see the hierarchical design as being the pinnacle of design; rather they saw it as a stop-gap measure on the way to richer models. Researchers have explored richer models. Jeff Mogul, now at Google Research, did his PhD thesis around various ideas about improving file organization. Eno Thereska, now at Amazon, wrote an intriguing paper while at Microsoft Research entitled “Beyond file systems: understanding the nature of places where people store their data” in which he and his team pointed out that cloud storage was creating a tension between file systems and cloud storage. The article from the Verge that prompted me to write this post logically makes sense in the context of what Thereska was saying back in 2014.

The challenge is to figure out what comes instead. Two summers ago I was fortunate enough to have a very talented young intern working with me for a couple months and during that time one of the interesting things he built was a tool that viewed files as a graph rather than a tree. The focus was always at the center, but then it would be surrounded by related files. Pick one of those files and it became the central focus, with a breadcrumb trail showing how you got there but also showing other related files.

The relationships we used were fairly simple and extracted from existing file meta-data. What was actually quite fascinating about it though was that we constructed it to tie two disjoint storage locations (his local laptop and his Google Drive) together into a single namespace. It was really an electrifying demonstration and I have been working to figure out how to enable that more fully – what we had was a mock-up, with static information, but the visualization aspects of “navigating” through files was quite powerful.

I have been writing my thesis proposal, and as part of that I’ve been working through and identifying key work that has already been done. Of course my goal is to build on top of this prior work and while I have identified ways of doing this, I also see that to be truly effective it should use as much of the prior work as possible. The idea of not having directories is a surprisingly powerful one. What I hadn’t heard previously was the idea of considering it to be a “laundry basket” yet the metaphor is quite apt. Thus, the question is how to enable building tools to find the specific thing you want from the basket as quickly as possible.

For example, the author of the Verge article observed: “More broadly, directory structure connotes physical placement — the idea that a file stored on a computer is located somewhere on that computer, in a specific and discrete location.” Here is what I recently wrote in an early draft of my thesis proposal: “This work proposes to develop a model to separate naming from location, which enables the construction of dynamic cross-silo human usable name-spaces and show how that model extends the utility of computer storage to better meet the needs of human users.”

Naming tied to location is broken, at least for human users. Oh, sure, we need to keep track of where something is stored to actually retrieve the contents, but there is absolutely no reason that we need to embed that within the name we use to find that file. One reason for this is that we often choose the location due to external factors. For example, we might use cloud storage for sharing specific content with others. People that work with large data sets often use storage locations that are tuned to the needs of that particular data set. There is, however, no reason why you should store the Excel spreadsheet or Python notebook that you used to analyze that data in the same location. Right now, with hierarchical names, you need to do so in order to put them into the “right directory” with each other.

That’s just broken.

However, it’s also broken to expect human users to do the grunt work here. The reason Gen Z is using a “laundry basket” is because it doesn’t require any effort on their part to put something into that basket. The work then becomes when they need to find a particular item.

This isn’t a new idea. Vannevar Bush described this idea in 1945:

“Consider a future device for individual use, which is a sort of
mechanized private file and library. It needs a name, and, to coin
one at random, ”memex” will do. A memex is a device in which
an individual stores all his books, records, and communications,
and which is mechanized so that it may be consulted with exceeding
speed and flexibility. It is an enlarged intimate supplement to his
memory.”

He also did a good job of explaining why indexing (the basis of hierarchical file systems) was broken:

“Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.

“The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental
pictures, is awe-inspiring beyond all else in nature.”

We knew it was broken in 1945. What we’ve been doing since then is using what we’ve been given and making it work as best we can. We knew it was broken. Seltzer wrote “Hierarchical File Systems Are Dead” back in 2009. Yet, that’s what computers still serve up as our primary interface.

The question then is what the right primary interface is. Of course, while I find that interesting I work with computer systems and I am equally concerned about how we can build in better support, using the vast amount of data that we have in modern computer systems, to construct better tools for navigating through the laundry basket to find the correct thing.

How I think that should be done will have to wait for another post, since that’s the point of my thesis proposal.

Where has the time gone?

August 5, 2020

It’s been more than a year since I last posted; it’s not that I haven’t been busy, but rather that I’ve been trying to do too many things and have been (more slowly than I’d like) cutting back on some of my activities.

Still, I miss using this as a (one way) discussion about my own work. In the past year I’ve managed to publish one new (short) paper, though the amount of work that I put into it was substantial (it was just published in Computer Architecture Letters). This short article (letter) journal normally provides at most one revise and resubmit opportunity, but they gave me two such opportunities, then accepted the paper, albeit begrudgingly over the objections of Reviewer # 2 (who agreed to accept it, but didn’t change their comments).

Despite the lack of clear publications to demonstrate forward progress, I’ve been working on a couple of projects to push them along. Both were presented, in some form, at Eurosys as posters.

Since I got back from a three month stint at Microsoft Research (in the UK) I’ve been working on one of those, evolving the idea of kernel bypasses and really analyzing why we keep doing these things; this time through the lens of building user mode file systems. I really should write more about it, since that’s on the drawing board for submission this fall.

The second idea is one that stemmed from my attendance at SOSP 2019. There were three papers that spoke directly to file systems:

Each of these had important insights into the crossover between file systems and persistent memory. One of the struggles I had with that short paper was explaining to people “why file systems are necessary for using persistent memory”. I was still able to capture some of what I’d learned, but a fair bit of it was sacrificed to adding background information.

One key observation was around the size of memory pages and their impact on performance; it convinced me that we’d benefit from using ever larger page sizes for PMEM. Some of this is because persistent memory is, well, persistent and thus we don’t need to “load the contents from storage”. Instead, it is storage. So, we’re off testing out some ideas in this area to see if we can contribute some additional insight.

The other area – the one that I have been ignoring too long – is the thesis of this PhD work in the first place. Part of the challenge is to reduce the problem down to something that is tractable and can be finished in a reasonable amount of time.

One of the questions (and the one I wanted to explore when I started writing this) is a rather famous article from 1945 entitled As We May Think. Vannevar Bush described something quite understandable, yet we have not achieve this, though we have been trying – one could argue that hypertext stems from these ideas, but I would argue that hypertext links are a pale imitation of the rich assistive model Bush lays out when he describes the Memex.

Thus, to the question, which I will reserve for another day: why have we not achieved this yet? What prevents us from having this, or something better, and how can I move us towards this goal?

I suspect, but am not certain, that one culprit may be the fact we decided to stick with an existing and well-understood model of organization:

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email